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| Abstract 

The first part of this work considers the entropy of the sum of (possibly dependent and non-identically distributed) Bernoulli 
f\j ' random variables. Upper bounds on the error that follows from an approximation of this entropy by the entropy of a Poisson 
^ . random variable with the same mean are derived via the Chen-Stein method. The second part of this work derives new lower 
^ bounds on the total variation distance and relative entropy between the distribution of the sum of independent Bernoulli random 
. variables and the Poisson distribution. The starting point of the derivation of the new bounds in the second part of this work 
1 is an introduction of a new lower bound on the total variation distance, whose derivation generalizes and refines the analysis 
by Barbour and Hall (1984), based on the Chen-Stein method for the Poisson approximation. A new lower bound on the 
C - ~ , " relative entropy between these two distributions is introduced, and this lower bound is compared to a previously reported 
P | . upper bound on the relative entropy by Kontoyiannis et al. (2005). The derivation of the new lower bound on the relative 
^ ■ entropy follows from the new lower bound on the total variation distance, combined with a distribution-dependent refinement of 
q , Pinsker's inequality by Ordentlich and Weinberger (2005). Upper and lower bounds on the Bhattacharyya parameter, Chernoff 
l— ~~ '■ information and Hellinger distance between the distribution of the sum of independent Bernoulli random variables and the 
Poisson distribution with the same mean are derived as well via some relations between these quantities with the total variation 
distance and the relative entropy. The analysis in this work combines elements of information theory with the Chen-Stein 
method for the Poisson approximation. The resulting bounds are easy to compute, and their applicability is exemplified. 
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> '. I. INTRODUCTION 

^ ■ Convergence to the Poisson distribution, for the number of occurrences of possibly dependent events, naturally 
arises in various applications. Following the work of Poisson, there has been considerable interest in how well the 
Poisson distribution approximates the binomial distribution. This approximation was treated by a limit theorem in 
|fl3l Chapter 8], and later some non-asymptotic results have considered the accuracy of this approximation. Among 
these old and interesting results, Le Cam's inequality ll35l provides an upper bound on the total variation distance 
between the distribution of the sum S n = J27=i of n independent Bernoulli random variables {Xi}f =1 , where 
Xi ~ Bern(pj), and a Poisson distribution Po(A) with mean A = Yli=iPi- This inequality states that 

i 00 — X\k n 

d T v(p 5 „,po(A)) 4 i j2 \ns n = k)- e -^\ < j> 2 

k=0 



so if, e.g., Xi ~ Bern(^) for every i G {1, . . . , n} (referring to the case that S n is binomially distributed) then this 

\ 2 

upper bound is equal to — , thus decaying to zero as n tends to infinity. This upper bound was later improved, e.g., 



by Barbour and Hall (see J4j Theorem 1]), replacing the above upper bound by ^— | — J Y^h=ip1 an( ^ therefore 
improving it by a factor of i when A is large. This improved upper bound was also proved by Barbour and Hall 
to be essentially tight (see (4, Theorem 2]) with the following lower bound on the total variation distance: 



d TV (P 5 „,Po(A)) > 1 min{l,i} 
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so the upper and lower bounds on the total variation distance differ by a factor of at most 32, irrespectively of the 
value of A (it is noted that in (5] Remark 3.2.2], the factor ^ in the lower bound was claimed to be improvable 
to with no explicit proof). The Poisson approximation and later also the compound Poisson approximation have 
been extensively treated in the literature (see, e.g., the reference list in Q and this paper). 

Among modern methods, the Chen-Stein method forms a powerful probabilistic tool that is used to calculate error 
bounds when the Poisson approximation serves to assess the distribution of a sum of (possibly dependent) Bernoulli 
random variables ifTOl . This method is based on the simple property of the Poisson distribution where Z ~ Po(A) 
with A G (0,oo) if and only if AE[/(Z + 1)] - E[Z f(Z)] = for all bounded functions / that are defined on 
No — {0, 1, . . .}. This method provides a rigorous analytical treatment, via error bounds, to the case where W has 
approximately the Poisson distribution Po(A) where it can be expected that AE[/(W + 1)] — E[M / /(M / )] « for 
an arbitrary bounded function / that is defined on No- The interested reader is referred to several comprehensive 
surveys on the Chen-Stein method in (3j, @, Chapter 2], 0, 01 Chapter 2] and l43l . 

During the last decade, information-theoretic methods were exploited to establish convergence to Poisson and 
compound Poisson limits in suitable paradigms. An information-theoretic study of the convergence rate of the 
binomial-to-Poisson distribution, in terms of the relative entropy between the binomial and Poisson distributions, 
was provided in (191 , and maximum entropy results for the binomial, Poisson and compound Poisson distributions 
were studied in HH, (HI, EH, O, El, lEQ and [52]]. The law of small numbers refers to the phenomenon that, 
for random variables {Xi}f =1 defined on No, the sum Y^i=i^i * s approximately Poisson distributed with mean 
A = Y17=iPi ^ (qualitatively) the following conditions hold: P(Xj = 0) is close to 1, P(Xj = 1) is uniformly 
small, P(Xi > 1) is negligible as compared to P(X, = 1), and {Xi}f =1 are weakly dependent (see ifTTl , H4l and 
B31 ). An information-theoretic study of the law of small numbers was provided in [33 ] via the derivation of upper 
bounds on the relative entropy between the distribution of the sum of possibly dependent Bernoulli random variables 
and the Poisson distribution with the same mean. An extension of the law of small numbers to a thinning limit 
theorem for convolutions of discrete distributions that are defined on No was introduced in 11221 . followed by an 
analysis of the convergence rate and some non-asymptotic results. Further work in this direction was studied in |30l , 
and the work in Q provides an information-theoretic study for the problem of compound Poisson approximation, 
which parallels the earlier study for the Poisson approximation in [33]. A recent follow-up to the works in (7J and 
11331 is provided in [36] and (371, considering connections between Stein characterizations and Fisher information 
functionals. Nice surveys on the line of work on information-theoretic aspects of the Poisson approximation are 
introduced in (28l Chapter 7] and (34l . Furthermore, lfT3l Chapter 2] surveys some commonly-used metrics between 
probability measures with some pointers to the Poisson approximation. 

This paper provides an information-theoretic study of the Poisson approximation via the Chen-Stein method. The 
novelty of this paper is considered to be in the following aspects: 

• Consider the entropy of a sum of (possibly dependent and non-identically distributed) Bernoulli random 
variables. Upper bounds on the error that follows from an approximation of this entropy by the entropy 
of a Poisson random variable with the same mean are derived via the Chen-Stein method (see Theorem [5] and 
its related results in Section [III). The use of these new bounds is exemplified for some interesting applications 
of the Chen-Stein method in O and (3J. 

• Improved lower bounds on the relative entropy between the distribution of a sum of independent Bernoulli 
random variables and the Poisson distribution with the same mean are derived (see Theorem [7J in Section [TTTb. 
These new bounds are obtained by combining a derivation of some sharpened lower bounds on the total 
variation distance (see Theorem [6] and some related results in Section HlD ) that improve the original lower 
bound in 0] Theorem 2], and a probability-dependent refinement of Pinsker's inequality (38l . The new lower 
bounds are compared with existing upper bounds. 

• New upper and lower bounds on the Chernoff information and Bhattacharyya parameter are also derived in 
Section [III] via the introduction of new bounds on the Hellinger distance and relative entropy. The use of the 
new lower bounds on the relative entropy and Chernoff information is exemplified in the context of binary 
hypothesis testing. The impact of the improvements of these new bounds is studied as well. 

To the best of our knowledge, among the publications of the IEEE Trans, on Information Theory, the Chen-Stein 
method for Poisson approximation was used so far only in two occasions. In (49l , this probabilistic method was 
used by A. J. Wyner to analyze the redundancy and the distribution of the phrase lengths in one of the versions of 
the Lempel-Ziv data compression algorithm. In the second occasion, this method was applied in lfl6ll in the context 
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of random networks. In |[T6l , the authors relied on existing upper bounds on the total variation distance, applying 
them to analyze the asymptotic distribution of the number of isolated nodes in a random grid network where nodes 
are always active. The first part of this paper relies (as well) on some existing upper bounds on the total variation 
distance, with the purpose of obtaining error bounds on the Poisson approximation of the entropy for a sum of 
(possibly dependent) Bernoulli random variables, or more generally for a sum of non-negative, integer-valued and 
bounded random variables (this work relies on stronger versions of the upper bounds in |fl6l Theorems 2.2 and 2.4]). 

The paper is structured as follows: Section ITT1 forms the first part of this work where the entropy of the sum of 
Bernoulli random variables is considered. Section|III]provides the second part of this work where new lower bounds 
on the total variation distance and relative entropy between the distribution of the sum of independent Bernoulli 
random variables and the Poisson distribution are derived. The derivation of the new and improved lower bounds 
on the total variation distance relies on the Chen-Stein method for the Poisson approximation, and it generalizes 
and tightens the analysis that was used to derive the original lower bound on the total variation distance in @). The 
derivation of the new lower bound on the relative entropy follows from the new lower bounds on the total variation 
distance, combined with a distribution-dependent refinement of Pinsker's inequality in [38]. The new lower bound 
on the relative entropy is compared to a previously reported upper bound on the relative entropy from l33l . Upper 
and lower bounds on the Bhattacharyya parameter, Chernoff information and the Hellinger, local and Kolmogorov- 
Smirnov distances between the distribution of the sum of independent Bernoulli random variables and the Poisson 
distribution with the same mean are also derived in Section [III] via some relations between these quantities with 
the total variation distance and the relative entropy. The analysis in this work combines elements of information 
theory with the Chen-Stein method for Poisson approximation. The use of these new bounds is exemplified in the 
two parts of this work, partially relying on some interesting applications of the Chen-Stein method for the Poisson 
approximation that were introduced in (2[ and 0. The bounds that are derived in this work are easy to compute, 
and their applicability is exemplified. Throughout the paper, the logarithms are expressed on the natural base (on 
base e). 

II. Error Bounds on the Entropy of the Sum of Bernoulli Random Variables 

This section considers the entropy of a sum of (possibly dependent and non-identically distributed) Bernoulli 
random variables. Section III-AI provides a review of some reported results on the Poisson approximation, whose 
derivation relies on the Chen-Stein method, that are relevant to the analysis in this section. The original results of 
this section are introduced from Section ITl-B I which provides an upper bound on the entropy difference between two 
discrete random variables in terms of their total variation distance. This bound is later in this section in the context 
of the Poisson approximation. Section [Tl-C I introduces some explicit upper bounds on the error that follows from the 
approximation of the entropy of a sum of Bernoulli random variables by the entropy of a Poisson random variable 
with the same mean. Some applications of the new bounds are exemplified in Section III-DI and these bounds are 
proved in Section IIT-El Finally, a generalization of these bounds is introduced in Section IIT-Fl to address the case of 
the Poisson approximation for the entropy of a sum of non-negative, integer-valued and bounded random variables. 

A. Review of Some Essential Results for the Analysis in Section |77| 

Throughout the paper, we use the term 'distribution' to refer to the discrete probability mass function of an 
integer-valued random variable. In the following, we review briefly some known results that are used for the 
analysis later in this section. 

Definition 1: Let P and Q be two probability measures defined on a set X. Then, the total variation distance 
between P and Q is defined by 

d TV (P,Q)= sup \P{A)-Q{A)\ (1) 

Borel ACX 

where the supermum is taken w.r.t. all the Borel subsets A of X. If X is a countable set then (Q]) is simplified to 

dMP,Q) = IJ2 l p ( x ) " Q(*)\ = llP ~ 2 Qlh (2) 

so the total variation distance is equal to one-half of the ^-distance between the two probability distributions. 
The following theorem combines (4l Theorems 1 and 2], and its proof relies on the Chen-Stein method: 
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Theorem 1: Let W = 2~27=l be a sum of n independent Bernoulli random variables with EpQ) = pi for 
% 6 {1, . . . , n}, and E(W) = A. Then, the total variation distance between the probability distribution of W and 
the Poisson distribution with mean A satisfies 

32( 1A a) E^^^v(i :, H/,Po(A))< (^— ) $>i ( 3 ) 

i=l ^ ' i=l 

where a A b = min{a, 6} for every a, 6 G R. 

Remark 1: The ratio between the upper and lower bounds in Theorem[T]is not larger than 32, irrespectively of the 
values of {pi}. This shows that, for independent Bernoulli random variables, these bounds are essentially tight. The 
upper bound in (0 improves Le Cam's inequality (see ll35l . ll47l V) which states that c?tv(-FW> P°(A)) < 2~2i=iPi 
so the improvement, for large values of A, is approximately by the factor i. 

Theorem[T]provides a non-asymptotic result for the Poisson approximation of sums of independent binary random 
variables via the use of the Chen-Stein method. In general, this method enables to analyze the Poisson approximation 
for sums of dependent random variables. To this end, the following notation was used in O and 0: 

Let I be a countable index set, and for a G I, let X a be a Bernoulli random variable with 

Pa 4 F(X a = 1) = 1 - F(X a = 0) > 0. (4) 

Let 

w±y. x °" \ = nw) = Y J p« ( 5 ) 

a£l a&I 

where it is assumed that A G (0, oo). For every a £ I, let B a be a subset of / that is chosen such that a G B a . This 
subset is interpreted in [2] as the neighborhood of dependence for a in the sense that X a is independent or weakly 
dependent of all of the Xp for (3 ^ B a . Furthermore, the following coefficients were defined in (21 Section 2]: 

b2 ~^2 E Va 'P' Pa >P - E ( X aXf3) (7) 
a£l /3£B a \{a} 

&3 = E S «. s a ±E\E(X Q -p a \a({Xp})^j\ B j\ (8) 

where cr(-) in the conditioning of © denotes the er-algebra that is generated by the random variables inside the 
parenthesis. In the following, we cite [2, Theorem 1] which essentially implies that when b±, 62 and 63 are all small, 
then the total number W of events is approximately Poisson distributed. 

Theorem 2: Let W = 2~2ael ^ a ^ e a sum °^ (P oss ibly dependent and non-identically distributed) Bernoulli 
random variables {X a } a& j. Then, with the notation in (Hi-®, the following upper bound on the total variation 
distance holds: . 

drv(JV,Po(A)) < (61 + 62) {—^—) + & 3 (l A . (9) 

Remark 2: A comparison of the right-hand side of © with the bound in (2] Theorem 1] shows a difference in a 
factor of 2 between the two upper bounds. This follows from a difference in a factor of 2 between the two definitions 
of the total variation distance in |2] Section 2] and Definition [T] here. It is noted, however, that Definition Q] in this 
work is consistent, e.g., with [4] and (5). 

Remark 3: Theorem [2] forms a generalization of the upper bound in Theorem Q] by choosing B a = {a} for 
a £ I = {1, . . . ,n} (note that, due to the independence assumption of the Bernoulli random variables in TheoremQ] 
the neighborhood of dependence of a is a itself). In this setting, under the independence assumption, 



Pi: b 2 = b 3 = 



i=l 

which therefore gives, from d9]), the upper bound on the right-hand side of ©. 
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Before proceeding to this analysis, the following maximum entropy result of the Poisson distribution is introduced. 
Theorem 3: The Poisson distribution Po(A) has the maximal entropy among all probability distributions with 
mean A that can be obtained as sums of independent Bernoulli RVs: 

H(Po(\)) = sup H(S) 



-Boo (A) = |J B n (\) 



MEN 



B n (X) = IS : S = ^Xi, Xi~ Bern(pi) independent, = A I . (10) 

I i=l i=l ) 

Furthermore, since the supremum of the entropy over the set B n (X) is monotonic increasing in n, then 

H{Po{\)) = lim sup H(S). 

n-^oo SeBn (X) 

For n G N, the maximum entropy distribution in the class B n (X) is the Binomial distribution of the sum of n i.i.d. 
Bernoulli random variables Ber( - ), so 



iJ(Po(A)) = lim H ( Binomial (n, - 

n->oo \ \ n 



Remark 4: Theorem [3] partially appears in ll32l Proposition 2.1] (see IT321 Eq. (2.20)]). This theorem follows 
directly from [18, Theorems 7 and 8]. 

Remark 5: The maximum entropy result for the Poisson distribution in Theorem [3] was strengthened in |[28l by 
showing that the supermum on the right-hand side of (fTOl can be extended to the larger set of ultra-log-concave 
probability mass functions (that includes the binomial distribution). This result for the Poisson distribution was 
generalized in 11291 and 11311 to maximum entropy results for discrete compound Poisson distributions. 

Calculation of the entropy of a Poisson random variable: In the next sub-section, we consider the approximation 
of the entropy of a sum of Bernoulli random variables by the entropy of a Poisson random variable with the same 
mean. To this end, it is required to evaluate the entropy of Z ~ Po(A). It is straightforward to verify that 

ff ( Z ) = A,o g (£) + f^^ (ID 

k=l 

so the entropy of the Poisson distribution (in nats) is given in terms of an infinite series that has no closed-form 
expression. Sequences of simple upper and lower bounds on this entropy, which are asymptotically tight, were 
derived in |1|. In particular, from |fl] Theorem 2], 

T! * J ' <^(Z)-ilog(2. e A) + ^<-^ + -L (12) 



24A 2 20A 3 20A 4 ~ v ' 2 bV ; 12A ~ 24A 2 60A 3 

which gives tight bounds on the entropy of Z ~ Po(A) for large values of A. For A > 20, the entropy of Z 
is approximated by the average of its upper and lower bounds in (fT2l . asserting that the relative error of this 
approximation is less than 0.1% (and it decreases like i while increasing the value of A). For A G (0,20), 
a truncation of the infinite series on the right-hand side of (fTTT) after its first [10 A] terms gives an accurate 
approximation. 



B. A New Bound on the Entropy Difference of Two Discrete Random Variables 

The following theorem provides a new upper bound on the entropy difference between two discrete random 
variables in terms of their total variation distance. This theorem relies on the bound of Ho and Yeung in ll23l 
Theorem 6] that forms an improvement over the previously reported bound in ATI Theorem 17.3.3] or |[T2l 
Lemma 2.7]. The following new bound is later used in this section in the context of the Poisson approximation. 
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Theorem 4: Let A = {ai,a,2, . . .} be a countable infinite set. Let X and Y be two discrete random variables 
where X takes values from a finite set X = {a\, . . . ,a m }, for some m £ N, and Y takes values from the entire 
set A. Assume that 

d T v(x,y)<7? (13) 

for some 7] £ [0, 1), and let 

M = max < m + 1, — - — 1 . (14) 

Furthermore, let \i > be set such that 

oo 

- J2 ^V(oi) !°g Pr(<H) < I* (15) 

i=M 

then 

\H(X) - H(Y)\ < j] log(M - 1) + h(rj) + /U (16) 

where /i denote the binary entropy function. 

Proof: Let 7 be a random variable that is defined to be equal to Y if Y G {eti, . . . , a^_i}, and it is set to 
be equal to au if Y = a{ for some i > M. Hence, the probability mass function of Y is related to that of Y as 
follows 

p , , j iV(fli) ift€{l,...,M-l} 
[ Ej=M P Y{aj) if t = M. 

Since Px{ai) = for every i > m and M > m + 1, then it follows from (fTTl ) that 

d TV (x,y) 

^ m ^ M— 1 ^ 

= -^|P x ( fl! )-PyWI + 2 E ^V(a*) + 2 P ?( a ^) 

j=l i=m+l 
1 m 1 X 

= 2Eiw-wi + 2 E ^fo) 

i=l i=m+l 

= drv(X,y). (18) 



Hence, X and Y are two discrete random variables that take values from the set {a%, . . . , om} (note that it includes 
the set X) and d TV (X, Y) < rj (see ( [TBI and (fT8T)). The bound in |[23l Theorem 6] therefore implies that if r\ < 1— tj 
(which is indeed the case, due to the way M is defined in (fl4"l)). then 



\H(X)-H(Y)\ < 7/log(M-l) + /»(»/). (19) 
Since F is a deterministic function of Y then = H(Y,Y) > H(Y), and therefore ( fTSl ) and ( fTTT ) imply that 

|fr(y)-H(y)j 
= fr(y) - H(Y) 

oo / oo \ / oo \ 

= -E ^o*) lo g p ^( a ') + E 1o m E p ^( ai ) 

i=M \i=M / \i=M / 

oo 

< - ^ PyK) log iV(Oi) 

i=M 

< /i. (20) 
Finally, the bound in (fl6l ) follows from (fl9l ), (1201) and the triangle inequality. ■ 
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C. New Error Bounds on the Entropy of Sums of Bernoulli Random Variables 

The new bounds on the entropy of sums of Bernoulli random variables are introduced in the following. Their 
use is exemplified in Section III-DI and their proofs appeal - in Section III-EI 

Theorem 5: Let I be an arbitrary finite index set with |/| = n. Under the assumptions of Theorem |2] and the 
notation used in Eqs. ©-([8]), let 



V = (61 + 62) 



M = max < n + 2 



+ 63 1 A 



1 — 7] 



A log 



XJJ + 



x2 61og2vr +1 

+ A H — — 

12 



cxp 



A + (M - 2) log 



M 



Ae 



(21) 
(22) 
(23) 



where, in d23l , = max{x,0} for every x£l. Let Z ~ Po(A) be a Poisson random variable with mean A. If 
7] < 1, then the difference between the entropies of Z and W satisfies the following inequality: 



\H(Z) - H(W)\ < 7] log(M - 1) + h{7]) + ix. 



(24) 



The following corollary refers to the entropy of a sum of independent Bernoulli random variables: 
Corollary 1: Consider the setting in Theorem [51 and assume that the Bernoulli random variables {X a } ae j are 
also independent. Then, the following inequality holds: 

< H(Z) - H{W) < 7] log(M - 1) + h(rj) + n (25) 

where 7] in (1211 ) is specialized to 

ri^(^f^)E^ (26) 



The following bound forms a possible improvement of the result in Corollary [Q 

Proposition 1: Assume that the conditions in Corollary Q] are satisfied. Then, inequality (1251 ) holds with the new 
parameter 

17 = min { 1 - e~\ 3 _ - 1 (27) 

I 4e(i - Veyi 2 J 

where 

Remark 6: From (|28l ) and (|29l ), it follows that < 9 < max Q6 / p a = p max . The condition that 7] < 1 is mild 
since it is a meaningful upper bound on the total variation distance (which is bounded by 1). 

Remark 7: Proposition Q] improves the bound in Corollary Q] only if 9 is below a certain value that depends on 
A. The maximal improvement that is obtained by Proposition [T] as compared to Corollary [T] is in the case where 
# — ^ and A — > 00, and the corresponding improvement in the value of rj is by a factor of j- « 0.276. 
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D. Applications of the New Error Bounds on the Entropy 

In the following, the use of Theorem [5] is exemplified for the estimation of the entropy of sums of (possibly 
dependent) Bernoulli random variables. It starts with a simple example where the summands are independent binary 
random variables, and some interesting examples from [21 Section 3] and [3, Section 4] are considered next. These 
examples are related to sums of dependent Bernoulli random variables, where the use of Theorem [5] is exemplified 
for the calculation of error bounds on the entropy via the Chen-Stein method. 

Example 1 (sums of independent Bernoulli random variables): Let W = Y^=i^i ^ e a sum °f n independent 
Bernoulli random variables where X{ ~ Bern(pj) for i = 1, . . . , n. The calculation of the entropy of W involves 
the numerical computation of the probabilities 

(iV(0),-FW(l),.. -,Pw(n)) = (1 - Pi, Pi) * 0-~P2,Pz) * ■■■ 0-~Pn,Pn) 

whose computational complexity is high for very large values of n, especially if the probabilities p\,...,p n are 
not the same. The bounds in Corollary Q] and Proposition Q] provide rigorous upper bounds on the accuracy of the 
Poisson approximation for H(W). Lets exemplify this in the case where 

Pi = 2a%, Vi € {l,...,n} 5 a = 1CT 10 , n = 10 8 

then 

n 

\ = Y^Pi = an(n + 1) = 1, 000, 000.01 « 10 6 

i=l 

and from (|29]> 

„ 1 A o 2a(2n + 1) 

e = -V ft 2 = — i — = 0.0133. 

A ^ 1 3 

i=i 

The entropy of the Poisson random variable Z ~ Po(A) is evaluated via the bounds in (fT2l) (since A > 1, these 
bounds are tight), and they imply that H(Z) = 8.327 nats. From Corollary Q] (see Eq. d25T ) where I = {1, . . . ,n}), 
it follows that < H(Z)-H(W) < 0.316 nats, and Proposition Q] improves it to < H(Z)-H(W) < 0.110 nats. 
Hence, H(W) « 8.272 nats with a relative error of at most 0.7%. 

Example 2 (random graphs): This problem, which appears in (21 Example 1], is described as follows: On the 
cube {0, l} n , assume that each of the n2 n_1 edges is assigned a random direction by tossing a fair coin. Let 
k G {0, 1, . . . , n} be fixed, and denote by W = W(k, n) the random variable that is equal to the number of vertices 
at which exactly k edges point outward (so k = corresponds to the event where all n edges, from a certain vertex, 
point inward). Let I be the set of all 2 n vertices, and X a be the indicator that vertex a € I has exactly k of its 
edges directed outward. Then W = >~2aei X a with 

X a ~ Bern(p) , p = 2~ n Q , Va G I. 

This implies that A = m (since |/| = 2 n ). Clearly, the neighborhood of dependence of a vertex a £ I, denoted by 
B a , is the set of vertices that are directly connected to a (including a itself since Theorem |2] requires that a G B a ). 
It is noted, however, that B a in (2l Example 1] was given by B a = {/3: |/3 — a\ = 1} so it excluded the vertex 
a. From Q, this difference implies that b\ in their example should be modified to 

6i = iiiib„i [2-"(^ xx2 

= 2-> + l)(?j 2 (30) 

so 6i is larger than its value in (21 p. 14] by a factor of 1 + i which has a negligible effect if n S> 1. As is noted 
in (21 p. 14], if a and (3 are two vertices that are connected by an edge, then a conditioning on the direction of 
this edge gives that 

p Q ,p ± nX a X p ) = 2 2 - 2n ( n 7 1 V" ~ ]), V a G /, p G B a \ {a} 



k J \k-l 
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and therefore, from (|7]), 

--^(YX^D- 

Finally, as is noted in [2j Example 1], 63 = (this is because the conditional expectation of X a given (Xp) pei\B a 
is, similarly to the un-conditional expectation, equal to p a ; i.e., the directions of the edges outside the neighborhood 
of dependence of a are irrelevant to the directions of the edges connecting the vertex a). 

In the following, Theorem [5] is applied to get a rigorous error bound on the Poisson approximation of the entropy 
H(W). Table U presents numerical results for the approximated value of H(W), and the maximal relative error 
that is associated with this approximation. Note that, by symmetry, the cases with W(k, n) and W(n — k, n) are 
equivalent, so H(W(k,n)) = H(W(n — k,n)). 

TABLE I 

Numerical results for the Poisson approximations of the entropy H(W) (W = W(k, n)) by the entropy H(Z) where 

Z ~ Po(A), JOINTLY WITH THE ASSOCIATED ERROR BOUNDS OF THESE APPROXIMATIONS. THESE ERROR BOUNDS ARE CALCULATED 

FROM THEOREM[5]FOR THE RANDOM GRAPH PROBLEM IN EXAMPLE^ 



n 


k (or n — k) 


A = 


(k) 


Approximation of H(W) 


Maximal relative error 


30 


27 


4.060 


10 3 


5.573 nats 


0.1% 


30 


26 


2.741 


10 4 


6.528 nats 


0.5% 


30 


25 


1.425 


10 5 


7.353 nats 


2.3% 


50 


48 


1.225 


10 3 


4.974 nats 


7.6 ■ 10~ 10 


50 


46 


2.303 


10 5 


7.593 nats 


9.5 ■ 10" 8 


50 


44 


1.589 


10 7 


9.710 nats 


5.2 ■ 10~ 6 


50 


42 


5.369 


10 8 


11.470 nats 


1.5 ■ 10" 4 


50 


40 


1.027- 


10 10 


12.945 nats 


2.5 ■ 10~ 3 


100 


95 


7.529 


10 7 


10.487 nats 


7.9 • 10" 20 


100 


90 


1.731 ■ 


10 13 


16.660 nats 


1.2 • 10" 14 


100 


85 


2.533 • 


10 17 


21.456 nats 


1.3 ■ 10~ 10 


100 


80 


5.360 • 


10 20 


25.284 nats 


2.4 • 10~ 7 


100 


75 


2.425 • 


10 23 


28.342 nats 


9.6- 10" 5 


100 


70 


2.937 • 


10 25 


30.740 nats 


1.1% 



Example 3 (maxima of dependent Gaussian random variables): Consider a finite sequence of possibly depen- 
dent Gaussian random variables. The Chen-Stein method was used in [3 , Section 4.4] and ||26l to derive explicit 
upper bounds on the total variation distance between the distribution of the number of times (W) where this 
sequence exceeds a given level and the Poisson distribution with the same mean. The following example relies on 
the analysis in (3J Section 4.4], and it aims to provide a rigorous estimate of the entropy of the random variable that 
counts the number of times that the sequence of Gaussian random variables exceeds a given level. This estimation 
is done as an application of Theorem [5] In order to sharpen the error bound on the entropy, we derive a tightened 
upper bound on the coefficient 62 in © for the studied example; this bound on 62 improves the upper bound in (3J 
Eq. (21)], and it therefore also improves the error bound on the entropy of W. Note that the random variable W 
can be expressed as a sum of dependent Bernoulli random variables where each of these binary random variables 
is an indicator function that the corresponding Gaussian random variable in the sequence exceeds the fixed level. 
The probability that a Gaussian random variable with zero mean and a unit variance exceeds a certain high level is 
small, and the law of small numbers indicates that the Poisson approximation for W is good if the required level 
of crossings is high. 

By referring to the setting in (3J Section 4.4], let {Zj} be a sequence of independent and standard Gaussian 
random variables (having a zero mean and a unit variance). Consider a 1 -dependent moving average of Gaussian 
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P 4 E(Y, Y i+1 ) = - r — i . (32) 



random variables {Y{\ that are denned, for some 6 G M, by 

Y ± Zi + 0Zi+i Vi>L (31) 

This implies that E(Yj) = 0, K(l^ 2 ) = 1, and the lag-1 auto-correlation is equal to 

6 

iT< 

Let t > be a fixed level, n G N, and be the number of elements in the sequence {YJ., . . . , Y„} that exceed 
the level t. Then, W = X^=i^ * s tne sum °f dependent Bernoulli random variables where X{ = Isy^n f° r 
i G {1, ... ,n} (note that W is a sum of independent Bernoulli random variables only if 9 = 0). The expected 
value of W is 

E(W) = nF(Y 1 > t) = n (l - $(t)) = X n (t) (33) 

where 

1 



$(*) = -= / e" rdt, VtGR (34) 

V27T 7-c 



oo 



is the Gaussian cumulative distribution function. Considering the sequence of Bernoulli random variables {X a } a ^j 
where I = {1, . . . ,n} then, it follows from ((U) that 

p a =W(Y a >t) = l-Q(t), VaGl. (35) 

The neighborhood of dependence of an arbitrary a £ I is 

B a = {a — 1, a, a + 1} n / 

since Yq only depends in Y a -i,Y a ,Y a+ x. From ©, d33l ) and d35l ), and also because |i? a | < 3 for every a G /, 
then the following upper bound on b\ (see ©) holds (see (3l Eq. (21)]) 

b l <\I\ma f {\B a \p 2 a } = ^^- (36) 

ae.i n 

In the following, a tightened upper bound on 62 (as is defined in (0) is derived, which improves the bound in [3] 
Eq. (21)]. Since, by definition X a = l{y a >n, = hyp>t}> an d (from (0) p a p = ~E(X a Xp), then 

Pa<p = P(min{Y a ,Y /3 } > t) , Vael, f3 £ B a \ {a}. (37) 

Note that for every a E I and /3 G i? a \ {q}, necessarily j3 = a ± 1 so Y a and Y^ are jointly standard Gaussian 
random variables with the correlation p in (l32l (it therefore follows that p G [— |, |], achieving these two extreme 
values at 8 = ±1). From (3] Eq. (23) in Lemma 1], it follows that 



p a ,0 < * P \ ■ k(u) - u (1 - $(«))] (38) 



where 



"* ( Vrr7- <39) 

1 u 2 

V(«) = ^= exp(-y) = *'(«). (40) 
Finally, since |/| = n and |B a | < 3 for every a £ I, then |7]) and (l38l) lead to the following upper bound: 



- 2n V " ^ (n) " u (1 " ^ ))] (41) 



where <£, p, it and 99 are introduced, respectively, in Eqs. (1321 . (1341 . (1391 ) and (1401 . This improves the upper bound 
on 62 in 13 Eq. (21)] where the reason for this improvement is related to the weakening of an inequality in the 
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transition from |3 ( Eq. (23)] to [ 3 , Eq. (24)]. As is noted in (3] Eq. (21)], since Y a is independent of (5^3 Wb q , 
then it follows from ^ that 63 = 0. 

Having upper bounds on b\ and 62 (see (l36l ) and (PiTT )) and the exact value of 63, we are ready to use Theorem [5] 
to get error bounds for the approximation of H(W) by the entropy of a Poisson random variable Z with the same 
mean (i.e., Z ~ Po(A n (i)) where A n (t) is introduced in (l33l). Table JI] presents numerical results for the Poisson 
approximation of the entropy, and the associated error bounds. It also shows the improvement in the error bound 
due to the tightening of the upper bound on 62 in (|4TI ) (as compared to its original bound in [3] Eq. (21)]). 

TABLE E 

Numerical results for the Poisson approximations of the entropy H{W) in Example[3] It is approximated by H(Z) 

WHERE Z ~ Po(A„(t)) IN {33), AND THE ASSOCIATED ERROR BOUNDS ARE COMPUTED FROM THEOREM^ THE INFLUENCE OF THE 
TIGHTENED BOUND IN ||4T} IS EXAMINED BY A COMPARISON WITH THE LOOSENED UPPER BOUND ON 6 2 IN (3j EQ. (21)]. 



71 


g 


t (a fixed 


E(W) = X„(t) 


Poisson Approximation 


Maximal relative error with 




(En JIB') 


level) 


(En f33l) 


of H(W) 


tightened and loosened bounds 


10 4 


-1-1 


5 


2.87 • 10 -3 


fl7fl na ts 


1 9% (2 3%) 


10 6 


+i 


5 


0.287 


0.672 nats 


4 9% (6 0%) 


10 8 


+1 


5 


28.7 


3.094 nats 


4.9% (6.0%) 


10 10 


+1 


5 


2.87 ■ 10 3 


5.399 nats 


3.3% (4.1%) 


10 12 


+1 


5 


2.87 ■ 10 5 


7.702 nats 


2.7% (3.3%) 


10 4 


-l 


5 


2.87- 10~ 3 


0.020 nats 


3.8 • 10 -6 


10 6 


-l 


5 


0.287 


0.672 nats 


9.6 ■ 10~ 6 


10 8 


-l 


5 


28.7 


3.094 nats 


9.3 ■ 10~ 6 


10 10 


-l 


5 


2.87 ■ 10 3 


5.399 nats 


6.1 ■ 10~ 6 


10 12 


-l 


5 


2.87 ■ 10 5 


7.702 nats 


4.8 ■ 10~ 6 


10 4 


+1 


6 


9.87- 10~ 6 


1.24 ■ 10~ 4 nats 


0.2% (0.2%) 


10 6 


+1 


6 


9.87- 10~ 4 


0.008 nats 


0.3% (0.4%) 


10 s 


+1 


6 


9.87- 10~ 2 


0.327 nats 


0.7% (0.8%) 


10 10 


+1 


6 


9.87 


2.555 nats 


1.0% (1.2%) 


10 12 


+1 


6 


9.87 • 10 2 


4.866 nats 


0.6% (0.7%) 



Table [TT] supports the following observations, which are first listed and then explained: 

• For fixed values of n and 6, the Poisson approximation is improved by increasing the level t. 

• For fixed values of n and t, the error bounds for the Poisson approximation of the entropy improve when the 
value of 6 is modified in a way that decreases the lag-1 auto-correlation p in (l32l . 

• For fixed values of n and t, the effect of the tightened upper bound of 62 (see (|4TI )) on the error bound of the 
entropy H(W) is more enhanced when p is increased (via a change in the value of 9). 

• For fixed values of 9 and t, the error bounds for the Poisson approximation are weakly dependent on n. 

The explanation of these observations is, respectively, as follows: 

• For fixed values of n and 9, by increasing the value of the positive level t, the probability that a standard 
Gaussian random variable Yi (for i E {1, . . . , n}) exceeds the value t is decreased. The law of small numbers 
indicates on the enhancement of the accuracy of the Poisson approximation for W in this case. 

• For fixed values of n and t, the expected value of W (i.e., X n (t) in ( T33l ) is kept fixed, and so is the upper 
bound on 61 in (|36l ). However, if the correlation p in (l32l is decreased (by a proper change in the value of 9) 
then the value of u in ( f39b is increased, and the upper bound on 62 (see (|4TI )) is decreased. Since the upper 
bounds on 61 and 63 are not affected by a change in the value of 9 and the upper bound on 62 is decreased, 
then the upper bound on the total variation distance in Theorem [2] is decreased as well. This also decreases the 
error bound that refers to the Poisson approximation of the entropy in Theorem [5] Note that Table UT1 compares 
the situation for 9 = ±1, which corresponds respectively to p = ±^ (these are the two extreme values of p). 
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• When n and t are fixed, the balance between the upper bounds on b\ and 62 changes significantly while 
changing the value of 9. To exemplify this numerically, let n = 10 8 and t = 5 be the length of the sequence 
of Gaussian random variables and the considered level, respectively. If 6 = 1, the upper bounds on 61 and 
62 in d36l) and (|4TT ) are, respectively, equal to 2.47 • 1CT 5 and 0.176 (the loosened bound on 62 is equal to 
0.218). In this case, 62 dominates 61 and therefore an improvement in the value of 62 (or its upper bound) 
also improves the error bound for the Poisson approximation of the entropy H(W) in Theorem [5] Consider 
now the case where 9 = — 1 (while n, t are kept fixed); this changes the lag-1 autocorrelation p in (1321 from 
its maximal value (+|) to its minimal value (—5). In this case, the upper bound on 61 does not change, but 
the new bound on 62 is decreased from 0.218 to 6.88 • 10 -17 (and the loosened bound on 62, for 9 = — 1, 
is equal to 9.90 • 10 -16 ). In the latter case, the situation w.r.t. the balance between the coefficients b\ and 62 
is reversed, i.e., the bound on b\ dominates the bound on 62- Hence, the upper bound on the total variation 
distance and the error bound that follows from the Poisson approximation of the entropy H{W) are reduced 
considerably when 9 changes from +1 to —1. This is because, from Theorem |2j the upper bound on the total 
variation distance depends linearly on the sum 61 + 62 when 63 = 0). A similar conclusion also holds w.r.t. 
the error bound on the entropy (see Theorem [5]). In light of this comparison, the tightened bound on 62 affects 
the error bound for the Poisson approximation of H(W) when 9 = 1, in contrast to the case when 9 = — 1. 

• The numerical results in Table JI] show that the accuracy of the Poisson approximation is weakly dependent on 
the length n of the sequence {li}™ =1 . This is attributed to the fact that the probabilities pi, for i G {1, . . . , n}, 
are not affected by n but they are only affected by choice of the level t. Hence, the law of small numbers does 
not necessarily indicate on an enhanced accuracy of the Poisson approximation for H(W) when the length of 
the sequence n is increased. 



E. Proofs of the New Bounds in Section \II-C\ 

1 ) Proof of Theorem \5} The random variable W = J2aei X a is a sum of Bernoulli random variables where 
\I\ = n < 00, then W gets values from the set {0, 1, . . . , n}, and Z gets non-negative integer values. Theorem |4] 
therefore implies that 

\H(W) - H(Z)\ < r]log(M - 1) + %) + y. (42) 

and we need in the following to calculate proper constants rj, fi and M for the Poisson approximation. The cardinality 
of the set of possible values of W is m = n + 1, so it follows from ([T4l that M is given by (l22l ). The parameter 
rj, which serves as an upper bound on the total variation distance diy(W, Z), is given in (1211 ) due to the result in 
Theorem |2] The last thing that is now required is the calculation of p. Let 

Iix{k) ~ e ~kT~' Vfce{0 5 i,...} 

designate the probability distribution of Z ~ Po(A), so p is an upper bound on J2T=m{~ n^(fe) logIiA(£;)}, which 
is an infinite sum that only depends on the Poisson distribution. Straightforward calculation gives that 



J2{-u x (k) io g n A (A;)} 

OO OO CO 

a log a £ n A(*o + X Y1 + Yl lo s( fe! ) • < 43 ) 



k=M 



k=M-l k=M k=M 



From Stirling's formula, for every k G N, the equality kl = \j2-nk {^) k e Vk holds for some % G (tsFkp T5fc/ 
This therefore implies that the third infinite sum on the right-hand side of d43l ) satisfies 



jr u x (k) iog(fc!) 

k=M 

00 / / k\ k 1 \ 

< J] n A (A;) log (\/27rfc(-j e^M 

k=M \ / 
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< 



log(2vr) 



k=M 

oo 



k=M 

oo 



(k + -) log(/c) - k 



+ 



1 oo 

-V 

12 ^ 



n A (Ar) 



fe=M 



2 n A (fc) + ^ - 1) n A (^)} + -£ n A (fc) 



(a) log(2vr) 



oo 



fc=Af 



k=M 



En A (fc) + A 2 J] n A (^) + ^ ^ n A (fc) 



fc=M 



k=M-2 
oo 



k=M 



< 



v ' k=M-2 



(44) 



where the equality in (a) follows from the identity k(k — 1) II A (fe) = A 2 H\(k — 2) for every k > 2. By combining 
d43l ) and (01), it follows that 



^ -n A (fc) iogn A (fc) 



k=M 

< (^A log 



< 



Alog(- 



=Af-l ' k=M- 

oo 

E n ^)- 



n A (fc) 



n2 , 6lQg(27T) + l 

+ A + 12 



(45) 



J k=M-2 



Based on Chernoff's bound, since Z 



- Po(A), 




oo 




E 




fc=Af-2 




= P(Z > M - 


2) 


< inf fe- ( M - 






= inf \e'^ M ~ 


-2) e A(e 8 -l) 







exp 



A + (M - 2) log 



M 



Ae 



log 



' A/-2 N 



(46) 

in the exponent (note that 



where the last equality follows by substituting the optimized value 6 — 
X < n = m — 1 < M — 2, so optimized value of 9 is indeed non-negative). Hence, by combining d45l ) and d46l ), it 
follows that 



E {-HaO) logII A (fc)}</i 



(47) 



where the parameter [i is introduced in (|23T ). This completes the proof of Theorem [5] 

2) Proof of Corollary |7} For proving the right-hand side of (1251) . which holds under the assumption that the 
Bernoulli random variables {X a } a ^j are independent, one chooses (similarly to Remark [3) the set B a = {a} as 
the neighborhood of dependence for every a £ I. Note that this choice of B a is taken because o~(Xp)p e j\r a y} is 
independent of X a . From ©-([8]), this choice gives that b% = ^2 a& jPa an d ^2 = ^3 = which therefore implies 
the right-hand side of (1251 as a special case of Theorem [5] Furthermore, due to the maximum entropy result of the 
Poisson distribution (see Theorem [3), then H(Z) — H{W) > 0. This completes the proof of Corollary Q] 

3) Proof of Proposition^ Under the assumption that the Bernoulli random variables {X a } a ^j are independent, 
we rely here on two possible upper bounds on the total variation distance between the distributions of W and 
Z ~ Po(A). The first bound is the one in H Theorem 1], used earlier in Corollary Q] This bound gets the form 



cMJV,Po(A)) < 



E; 



(48) 
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where 9 is introduced in (|29l ). The second bound appears in [8 , Eq. (30)], and it improves the bound in ||4"0l Eq. (10)] 
(see also Pffl Eq. (4)]). This bound gets the form 

d jy (P w ,Fo(\)) < — — — . (49) 

4e(l-V0) ' 

It therefore follows that 

drv(iV,Po(A))<»/ (50) 



where 77 is defined in (1271) to be the minimum of the upper bounds on the total variation distance in (1481 and ((49 
The continuation of the proof of this proposition is similar to the proof of Corollary [TJ 



F. Generalization: Bounds on the Entropy for a Sum of Non-Negative, Integer-Valued and Bounded Random 
Variables 

We introduce in the following a generalization of the bounds in Section III-CI to consider the accuracy of the 
Poisson approximation for the entropy of a sum of non-negative, integer-valued and bounded random variables. The 
generalized version of Theorem [5] is first introduced, and it is then justified by relying on the proof of this theorem 
for sums of Bernoulli random variables with the approach of Serfling in |45l Section 7]. This approach enables to 
derive an explicit upper bound on the total variation distance between a sum of non-negative and integer-valued 
random variables and a Poisson distribution with the same mean. The requirement that the summands are bounded 
random variables is used to obtain an upper bound on the accuracy of the Poisson approximation for the entropy of a 
sum of non-negative, integer-valued and bounded random variables. The following proposition forms a generalized 
version of Theorem [5] 

Proposition 2: Let / be an arbitrary finite index set, and let \I\ = n. Let {X a } a ^j be non-negative, integer- 
valued random variables, and assume that there exists some A £ N such that X a G {0, 1, . . . , A} a.s. for every 
a el. Let 

W±Y. X °" Pa=nX a = l), q a ±P(X a >2), \ = y,p«> Q = ^Q Q (51) 

where A > and q > 0. Furthermore, for every a 6 I, let X' a be a Bernoulli random variable that is equal to 1 if 
X a = 1, and let it be equal otherwise to zero. Referring to these Bernoulli random variables, let 



El 1 a 



p> a> p 4 E(X' a X' p ) 



E\E(X' a -p a \a({Xp}) peI \ Ba )\ 



(52) 
(53) 
(54) 



where, for every a £ I, the subset B a C I is determined arbitrarily such that it includes the element a. Furthermore, 
let 



VA 



2{b' l + b' 2 ) 



1 



A 



+ b 



1.4 



+ q 



M A = max lnA + 2 



1 



(Alog( 



x2 61og2vr +1 

+ A H — — 

12 



cxp 



A + {Ma — 2) log 



M A 



Ae 



provided that t/a < 1. Then, the difference between the entropies (to base e) of W and Z ~ Po(A) satisfies 

\H(Z) - H{W)\ < rf A \og(M A - 1) + h( VA ) + Mj4 . 



(55) 

(56) 
(57) 

(58) 
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Proof: Following the approach in B31 Section 7], let X' a = l{x a =i} be a Bernoulli random variable that is 
equal to the indicator function of the event X a = 1 and f(X' a = 1) = p a for every a € I. Let W = J2aei X' a 
be the sum of the induced Bernoulli random variables. From the Chen-Stein method (see Theorem [2]) 

drv(JV',Po(A)) < (b'i + b' 2 ) (^T— ) + &/ 3 (! A ( 59 ) 
with the constants b' 1 ,b' 2 and b' 3 as defined in (f52b— (f54b. Furthermore, from P31 Eq. (7.2)], it follows that 

< F(W' / W) 

<$>(x^x Q ) 

= > 2) 

a£l 

It therefore follows from (j55]), (|59]> and ((60]> that 



dTv(^V,Po(A)) < dTv(^V, Pw) +d TY (P wll Po(X)) < r] A . 

The rest of this proof follows closely the proof of Theorem [5] (note that P\v(k) = for k > nA, so W gets 
m = nA + 1 possible values). This completes the proof of Proposition |2] ■ 



III. Improved Lower Bounds on the Total Variation Distance, Relative Entropy and Some 
Related Quantities for Sums of Independent Bernoulli Random Variables 

This section forms the second part of this work. As in the previous section, the presentation starts in Section IlII-AI 
with a brief review of some reported results that are relevant to the analysis in this section. Improved lower bounds 
on the total variation distance between the distribution of the sum of independent Bernoulli random variables and 
the Poisson distribution with the same mean are introduced in Section IIII-BI These improvements are obtained 
via the Chen-Stein method, by a non-trivial refinement of the analysis that was used for the derivation of the 
original lower bound by Barbour and Hall (see @ Theorem 2]). Furthermore, the improved tightness of the new 
lower bounds and their connection to the original lower bound are further considered. Section IIII-CI introduces 
an improved lower bound on the relative entropy between the above two distributions. The analysis that is used 
for the derivation of the lower bound on the relative entropy is based on the lower bounds on the total variation 
distance in Section IIII-B L combined with the use of the distribution-dependent refinement of Pinsker's inequality 
by Ordentlich and Weinberger ll38l (where the latter is specialized to the Poisson distribution). The lower bound on 
the relative entropy is compared to some previously reported upper bounds on the relative entropy by Kontoyiannis 
et al. ll33l in the context of the Poisson approximation. Upper and lower bounds on the Bhattacharyya parameter, 
Chernoff information and Hellinger distance between the distribution of the sum of independent Bernoulli random 
variables and the Poisson distribution are next derived in Section IIII-Di The discussion proceeds in Section IIII-EI 
by exemplifying the use of some of the new bounds that are derived in this section in the context of the classical 
binary hypothesis testing. Finally, Section IIII-FI proves the new results that are introduced in Sections IIII-CI and 
IIII-DI It is emphasized that, in contrast to the setting in Section [TT] where the Bernoulli random variables may be 
dependent summands, the analysis in this section depends on the assumption that the Bernoulli random variables are 
independent. This difference stems from the derivation of the improved lower bound on the total variation distance 
in Section IIII-BI which forms the starting point for the derivation of all the subsequent results that are introduced 
in this section, assuming an independence of the summands. 
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A. Review of Some Essential Results for the Analysis in Section [77ZI 

The following definitions of probability metrics are particularized and simplified to the case of our interest where 
the probability mass functions are defined on No- 

Definition 2: Let P and Q be two probability mass functions that are defined on a same countable set X. The 
Hellinger distance and the Bhattacharyya parameter between P and Q are, respectively, given by 

V xex J 
BC(P, Q) = J2 y/P{x)Q{x) (62) 

x&X 

so, these two probability metrics (including the total variation distance in Definition [B are bounded between 
and 1. 

Remark 8: In general, these probability metrics are defined in the setting where (X, d) is a separable metric 
space. The interest in this work is in the specific case where X = No and d = | • |. In this case, the expressions of 
these probability metrics are simplified as above. For further study of probability metrics and their properties, the 
interested reader is referred to, e.g., [5] Appendix A.l], |fT3l Chapter 2] and \39, Section 3.3]. 

Remark 9: The Hellinger distance is related to the Bhattacharyya parameter via the equality 

d H (P,Q) = Vl-BC(P,Q). (63) 

Definition 3: The Chernoff information and relative entropy (a.k.a. divergence or Kullback-Leibler distance) 
between two probability mass functions P and Q that are defined on a countable set X are, respectively, given by 

C(P,Q) 4 - min log f V P 6 {x)Q 1 - 6 {x) ) (64) 
ee[0 < 1] \xTx J 



so C(P,Q),D(P\\Q) 6 [0,oo]. Throughout this paper, the logarithms are on base e. 

Proposition 3: For two probability mass functions P and Q that are defined on the same set X 



d T v(P, Q) < V2d H (P, Q) < yjD(P\\Q). (66) 



The left-hand side of (l66l) is proved in ||39l p. 99], and the right-hand side is proved in |39l p. 328]. 

Remark 10: It is noted that the Hellinger distance in the middle of (166*1 ) is not multiplied by the square -root of 2 
in 11391 , due to a small difference in the definition of this distance where the factor of one-half on the right-hand 
side of (f6Tb does not appear in the definition of the Hellinger distance in |39l p. 98]. However, this is just a matter 
of normalization of this distance (as otherwise, according to [39], the Hellinger distance varies between and \[2 
instead of the interval [0, 1]). The definition of this distance in (f6Tb is consistent, e.g., with [5]. It makes the range 
of this distance to be between and 1, similarly to the total variation, local and Kolmogorov-Smirnov distances 
and also the Bhattacharyya parameter that are considered in this paper. 

The Chernoff information, C(P, Q), is the best achievable exponent in the Bayesian probability of error for binary 
hypothesis testing (see, e.g., ifTTl Theorem 11.9.1]). Furthermore, if X\, X2, ■ ■ ■ , -Xjy are i.i.d. random variables, 
having distribution P with prior probability 7Ti and distribution Q with prior probability 7T2, the following upper 
bound holds for the best achievable overall probability of error: 



P} ' <exp(-NC(P,Q)). 



(67) 
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A distribution-dependent refinement of Pinsker's inequality [38]: : Pinsker's inequality provides a lower bound 
on the relative entropy in terms of the total variation distance between two probability measures that are defined 
on the same set. It states that 

D(P\\Q) >2(d JW (P,Q)) 2 . (68) 



In 11381 , a distribution-dependent refinement of Pinsker's inequality was introduced for an arbitrary pair of probability 
distributions P and Q that are defined on No- It is of the form 

D(P\\Q) > v9(ttq) (dxv^Q)) 2 (69) 



where 



and 



ttq ± sup min{ Q ( A) , 1 - Q (A) } (70) 

ACNo 



„ W a| l^M 4 ?) lf ° <P< ^ (71) 
I 2 lf P = 3 

so ip is monotonic decreasing in the interval (0, |], 

lim <p(p) = +oo, lim ip(p) = 2 

where the latter limit implies that ip is left-continuous at one-half. Note that it follows from ( |70l that 7Tq E [0, 

In Section IIII-CI we rely on this refinement of Pinsker's inequality and combine it with the new lower bound on 
the total variation distance between the distribution of a sum of independent Bernoulli random variables and the 
Poisson distribution with the same mean that is introduced in Section IIII-BI The combination of these two bounds 
provides a new lower bound on the relative entropy between these two distributions. 



B. Improved Lower Bounds on the Total Variation Distance 

In Theorem [T] we introduced the upper and lower bounds on the total variation distance in 0, Theorem 1 and 2] 
(see also (5] Theorem 2.M and Corollary 3.D.1]). This shows that these upper and lower bounds are essentially 
tight, where the lower bound is about i of the upper bound. Furthermore, it was claimed in |5, Remark 3.2.2] 
(with no explicit proof) that the constant ^ in the lower bound on the left-hand side of © can be improved to 
jj. In this section, we obtain further improvements of this lower bound where, e.g., the ratio of the upper and 
new lower bounds on the total variation distance tends to 1.69 in the limit where A — > 0, and this ratio tends 
to 10.54 in the limit where A — > oo. As will be demonstrated in the continuation of Section [TTll the effect of 
these improvements is enhanced considerably when considering improved lower bounds on the relative entropy and 
some other related information-theoretic measures. We further study later in this section the implications of the 
improvement in lower bounding the total variation distance, originating in this sub-section, and exemplify these 
improvements in the context of information theory and statistics. 

Similarly to the proof of [4, Theorem 2], the derivation of the improved lower bound is also based on the 
Chen-Stein method, but it follows from a significant modification of the analysis that served to derive the original 
lower bound in [4, Theorem 2]. The following upper bound on the total variation distance is taken (as is) from 
H Theorem 1] (this bound also appears in Theorem Q] here). The motivation for improving the lower bound on 
the total variation distance is to take advantage of it to improve the lower bound on the relative entropy (via 
Pinsker's inequality or a refinement of it) and some other related quantities, and then to examine the benefit of this 
improvement in an information-theoretic context. 

Theorem 6: Let W = Y^i=\ ^ e a sum °f n independent Bernoulli random variables with EpQ) = pi for 
i E {1, . . . , n}, and E(W) = A. Then, the total variation distance between the probability distribution of W and 
the Poisson distribution with mean A satisfies 

ifi(A) 5>? < d jy (P w ,?o(\)) < (^— ) J>? (72) 

8=1 ^ ' 8=1 
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where K\ is given by 



and 



(A) 



sup 

Oil, Q!2 G R, 

a 2 < A+§, 
6> > 



1 - /i A (ai,«2,0) 
2^(01,02,6*) 



3A + (2 - a 2 + A) 3 - (1 - a 2 + A) 3 + |«i - a 2 \ (2A + |3 - 2a 2 |) exp 



ex 



2 A (* + ) 2 , Vxe 



1 + \l ■ \ai - a 2 \ I A + max{x(nj)} 



/i A (ai,a2,#) 
j;+ = max{x, 0}, x 

g\(a 1 ,a 2 ,9) = max 



2e 2 + y — • |qi - a 2 \J A - min{x(iii)} 

x(u) = (co + citi + c 2 u 2 ) exp(— u 2 ), \fu G R 

{ n J = ju G M : 2c 2 u 3 + 2ciu 2 - 2(c 2 - c )u - ci = o| 

Co — («2 — ai)(A — a 2 ) 
Cl ^VeX(X + ai -2a 2 ) 
c 2 4 -0A. 



0A 



(73) 



(74) 
(75) 



(76) 

(77) 

(78) 

(79) 
(80) 
(81) 



Remark 11: The upper and lower bounds on the total variation distance in (1721) scale like Y^i=iPi' similarly to 
the known bounds in Theorem Q] The ratio of the upper and lower bounds in Theorem Q] tends to 32.00 when either 
A tends to zero or infinity. It was obtained numerically that the ratio of the upper and lower bounds in Theorem [6] 
improves by a factor of 18.96 when A — > 0, a factor of 3.04 when A — > oo, and at least by a factor of 2.48 for all 
A G (0, oo). Alternatively, since the upper bound on the total variation distance in Theorems Q] and [6] is common, 
it follows that the ratio of the upper bound and new lower bound on the total variation distance is reduced to 1.69 
when A — > 0, it is 10.54 when A — > oo, and it is at most 12.91 for all A G (0, oo). 

Remark 12: [14, Theorem 1.2] provides an asymptotic result for the total variation distance between the distri- 
bution of the sum W of n independent Bernoulli random variables with E(AQ) = pi and the Poisson distribution 
with mean A = Y^=iPi- ^ snows that when Y^=iPi ~^ 00 anc ^ max i<i<nPi — > as n — > oo then 



d TV (P w , Po{\)) 



1 



Erf 



(82) 



This implies that the ratio of the upper bound on the total variation distance in (4l Theorem 1] (see Theorems Q] 
here) and this asymptotic expression is equal to \/2vre « 4.133. Therefore, in light of the previous remark (see 
Remark ITTb. it follows that the ratio between the exact asymptotic value in d82l ) and the new lower bound in d72~l ) 
is equal to 1 ^ 54 w 2.55. It therefore follows from Remark [TT] that in the limit where A — > 0, the new lower bound 

V2vre 

on the total variation in d72l is smaller than the exact value by no more than 1.69, and for A 3> 1, it is smaller 
than the exact asymptotic result by a factor of 2.55. 

Remark 13: Since {nj} in (T78T ) are zeros of a cubic polynomial equation with real coefficients, then the size 
of the set {uj} is either 1 or 3. But since one of the values of ui is a point where the global maximum of x is 
attained, and another value of Ui is the point where its global minimum is attained (note that lim^-too x{u) = 
and x is differentiable, so the global maxima and minima of x are attained at finite values where the derivative of 
x is equal to zero), then the size of the set {nj} cannot be 1, which implies that it should be equal to 3. 
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Remark 14: The optimization that is required for the computation of K% in d73l w.r.t. the three parameters 
01,02 G R and 9 G R + is performed numerically. The numerical procedure for the computation of K\ will be 
discussed later (after introducing the following corollary). 

In the following, we introduce a closed-form lower bound on the total variation distance that is looser than the 
lower bound in Theorem [6l but which already improves the lower bound in H Theorem 2]. This lower bound 
follows from Theorem [6] by the special choice of a\ = 02 = A that is included in the optimization set for K\ on 
the right-hand side of (|73T ). Following this sub-optimal choice, the lower bound in the next corollary follows by a 
derivation of a closed-form expression for the third free parameter 9 G M + . In fact, this was our first step towards 
the derivation of an improved lower bound on the total variation distance. After introducing the following corollary, 
we discuss it shortly, and suggest an optimization procedure for the computing K\ on the left-hand side of d72l . 

Corollary 2: Under the assumptions in Theorem [6l then 



ifi(A) 5>? < dTv(iV,Po(A)) < (—7— ) J>? ^ 
i=i ^ ' i=i 



where l , 7 , 

7 1 



6> = 3 + - + - • y (3A + 7) [(3 + 2e-V2)A + 7] . (85) 

Furthermore, the ratio of the upper and lower bounds on the total variation distance in (l83l tends to — m 20.601 
as A — > 0, it tends to 10.539 as A — > oo, and this ratio is monotonic decreasing as a function of A G (0, oo) (see 
the upper plot in Figure [T] and the calculation of the two limits in Section IIII-F3I ). 

Remark 15: The lower bound on the total variation distance on the left-hand side of (l8~3l improves uniformly 
the lower bound in \4, Theorem 2] (i.e., the left-hand side of Eq. ([3]) here). The improvement is by factors of 1.55 
and 3.03 for A — » and A — > oo, respectively. Note that this improvement is already remarkable since the ratio of 
the upper and lower bounds in [4, Theorems 1 and 2] (Theorem [T] here) is equal to 32 in these two extreme cases, 
and it is also uniformly upper bounded by 32 for all values of A G (0,oo). Furthermore, in light of Remark [TT] 
the improvement of the lower bound on the total variation distance in Theorem [6] over its loosened version in 
Corollary [2] is especially significant for small values of A, but it is marginal for large values of A; this improvement 
is by a factor of 11.88 in the limit where A — > 0, but asymptotically there is no improvement if A — > oo where 
it even holds for A > 20 (see Figure [T] where all the curves in this plot merge approximately for A > 20). Note, 
however, that even if A — > oo, the lower bounds in Theorem [6] and Corollary [2] improve the original lower bound 
in Theorem Q] by a factor that is slightly above 3. 

Remark 16: In light of Corollary |2j a simplified algorithm is suggested in the following for the computation of K\ 
in d73l ). In general, what we compute numerically is a lower bound on K\\ but this is fine since K\ is the coefficient 
of the lower bound on the left-hand side of (1731 . so its replacement by a lower bound still gives a valid lower bound 
on the total variation distance. The advantage of the suggested algorithm is its reduced complexity, as compared 
to a brute force search over the infinite three-dimensional region for (oi, 02, the numerical computation that is 
involved with this algorithm takes less than a second on a standard PC. The algorithm proceeds as follows: 

• It chooses the initial values 01 = 02 = A, and 9 as is determined on the right-hand side of (|85T ). The 
corresponding lower bound on the total variation distance from Theorem [6j for this sub-optimal selection of 
the three free parameters oi, 02, 9, is equal to the closed-form lower bound in Corollary [2] 

• At this point, the algorithm performs several iterations where at each iteration, it defines a certain three- 
dimensional grid around the optimized point from the previous iteration (the zeroth iteration refers to the 
initial choice of parameters from the previous item, and to the closed-form lower bound in Corollary [2]). At 
each iteration, the algorithm searches for the optimized point on the new grid (i.e., it computes the maximum 
of the expression inside the supremum on the right-hand side of (l73l among all the points of the grid, and it 
also updates the new location of this point (01,02,$) for the search that is made in the next iteration. Note 
that, from (f73l . the grid should exclude points (01,02,$) when either 9 < or 02 > A + |. 
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Fig. 1. The figure presents curves that correspond to ratios of upper and lower bounds on the total variation distance between the sum of 
independent Bernoulli random variables and the Poisson distribution with the same mean A. The upper bound on the total variation distance 
for all these three curves is the bound by Barbour and Hall (see [4 Theorem 1] or Theorem Q] here). The lower bounds that the three curves 
refer to them are the following: the curve at the bottom (i.e., the one which provides the lowest ratio for a fixed A) is the improved lower 
bound on the total variation distance that is introduced in Theorem [6] The curve slightly above it for small values of A corresponds to looser 
lower bound when cei and Q2 in J73t are set to be equal (i.e., ai — ct2 = a is their common value), so that the optimization of K\ for 
this curve is reduced to be a two-parameter maximization of K\ over the two free parameters a G R and 9 G K + . Finally, the curve at the 
top of this figure corresponds to the further loosening of this lower bound where a is set to be equal to A; this leads to a single-parameter 
maximization of K\ (over the parameter 9 G R + ) whose optimization leads to the closed-form expression of the lower bound in Corollary [5] 
For comparison, in order to assess the enhanced tightness of the new lower bounds, note that the ratio of the upper and lower bounds on 
the total variation distance from j4] Theorems 1 and 2] (or Theorem Q] here) is roughly equal to 32 for all values of A. 



• At the beginning of this recursive procedure, the algorithm take a very large neighborhood around the point 
that was selected at the previous iteration (or the initial selection of the point from the first item). The size of 
this neighborhood at each subsequent iteration shrinks, but the grid also becomes more dense around the new 
selected point from the previous iteration. 

It is noted that numerically, the resulting lower bound on K\ seems to be the exact value in (P73T ) and not just a 
lower bound. However, the reduction in the computational complexity of (a lower bound on) K\ provides a very 
fast algorithm. The conclusions of the last two remarks (i.e., Remarks [151 and [161 are supported by Figure Q] 



C. Improved Lower Bounds on the Relative Entropy 

The following theorem relies on the new lower bound on the total variation distance in Theorem [6[ and the 
distribution-dependent refinement of Pinsker's inequality in |[38l . Their combination serves to derive a new lower 
bound on the relative entropy between the distribution of a sum of independent Bernoulli random variables and 
a Poisson distribution with the same mean. The following upper bound on the relative entropy was introduced in 
ll33l Theorem 1]. Together with the new lower bound on the relative entropy, it leads to the following statement: 

Theorem 7: In the setting of Theorem [6j the relative entropy between the probability distribution of W and the 
Poisson distribution with mean A = E(W) satisfies the following inequality: 

( n \ 2 i n 3 

j> 2 <D(p w \\Pow)<jJ2rr^ ^ 
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where 



with K\ from (1731) . and 



K 2 (A)^m(A)(iT 1 (A)) 2 (87) 

95=t-t1 1o S ( t^-t) lf A G (0, log 2) 
m (A) ^ < V 2e X -V 8 V eX -V (88) 

2 if A > log 2. 

Remark 17: For the sake of simplicity, in order to have a bound in closed-form (that is not subject to numerical 
optimization), the lower bound on the relative entropy on the left-hand side of (l86l ) can be loosened by replacing 
Ki(X) on the right-hand side of (1871 ) with Ki(\) in (l84l and (l85l ). In light of Remark [131 this possible loosening 
of the lower bound on the relative entropy has no effect if A > 30. 

Remark 18: The distribution-dependent refinement of Pinsker's inequality from [ 3 8 ] yields that, when applied 
to a Poisson distribution with mean A, the coefficient m(A) in (|87T ) is larger than 2 for A G (0, log 2), and it is 
approximately equal to log(^) for A « 0. Hence, for A r* 0, the refinement of Pinsker's inequality in [38] leads 
to a remarkable improvement in the lower bound that appears in d86l)-(l88T), which is by approximately a factor of 
\ log(^). If, however, A > log 2 then there is no refinement of Pinsker's inequality (since m(A) = 2 in (|88T)). 

Remark 19: The combination of the original lower bound on the total variation distance from H Theorem 2] 
(see ([3])) with Pinsker's inequality (see (l68l) ) gives the following lower bound on the relative entropy: 

D(JV||Pb(A))>^(lA^) . (89) 

In light of Remarks [TT] and [TU it is possible to quantify the improvement that is obtained by the new lower bound 
of Theorem |7j in comparison to the looser lower bound in (l89l . The improvement of the new lower bound on the 
relative entropy is by a factor of 179.7 log(^) for A 0, a factor of 9.22 for A — > oo, and at least by a factor 
of 6.14 for all A G (0, oo). The conclusions in the last two remarks (i.e., Remark [T8l and [191 are supported by 
Figure [2] that refers to the special case of the relative entropy between the binomial and Poisson distributions. 

Remark 20: In (2Ql Example 6], it is shown that if E(X) < A then D(P x \\Po(\)) > ^ (E(X) - A) 2 . Since 
E(5 n ) = A then this lower bound on the relative entropy is not informative for the relative entropy D(Ps n || Po(A)). 
Theorem [7] and the loosened bound in ( f89l are, however, informative in the studied case. 

The author was notified in [21] about the existence of another recently derived lower bound on the relative entropy 
DiyPx || Po(A)) in terms of the variance of a random variable X with values in No (this lower bound appears in 
a currently un-published work). The two bounds were derived independently, based on different approaches. In the 
setting where X = Y^i=\Xi is a sum of independent Bernoulli random variables {AQ}™ =1 with E(JQ) = pi and 

A = K(X) = 2~27=iPi> tne two l° wer bounds on the relative entropy scale like (X^=i Pi) 2 ^ w ^ a different 
scaling factor. 



D. Bounds on Related Quantities 

1) Bounds on the Hellinger Distance and Bhattacharyya Parameter: The following proposition introduces a 
sharpened version of Proposition [3] 

Proposition 4: Let P and Q be two probability mass functions that are defined on a same set X. Then, the 
following inequality suggests a sharpened version of the inequality in ( f66b 

yi-^l-(d TV (P,Q)) 2 < d H (P Q) < ^l-e X p(-^^) (90) 

and 

_ D(P^Q1 \ ^ BC{p Q) ^ y J l _^ {PiQ)) \ (91) 
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Fig. 2. This figure refers to the relative entropy between the binomial and Poisson distributions with the same mean A. The horizontal axis 
refers to A, and the vertical axis refers to a scaled relative entropy n 2 D(Bin(n, ^-)j|Po(A)) Q^iLi ~ Bin(ra, — ) when Xi ~ Bern(p;) 
with pi = — is fixed for all i £ {1, . . . , n}). This scaling of the relative entropy is supported by the upper bound on the relative entropy by 
Kontoyiannis et al. (see 1331 Theorem 1]) that is equal to j Y^i=i i-' p - = 7i? + ^ ' s a ^ so supported by the new lower bounds in 

Theorems [7] and Eq. l |89t since the common term in these lower bounds is equal to (X^ILiP?) 2 ~ tt> so a multiplication of these lower 
bounds on the relative entropy by n 2 gives an expression that only depends on A. It follows from 1191 Theorem 1] (see also [TJ p. 2302]) 
that D(Bin(n, ^)||Po(A)) = ^-y + 0(— j) (so, the exact value is asymptotically equal to one-quarter of the upper bound). This figure 
shows the upper and lower bounds, as well as the exact asymptotic result, in order to study the tightness of the existing upper bound and 
the new lower bounds. By comparing the dotted and dashed lines, this figure also shows the significant impact of the refinement of the 
lower bound on the total variation distance by Barbour and Hall (see |4| Theorem 2]) on the improved lower bound on the relative entropy 
(the former improvement is squared via Pinsker's inequality or its refinement). Furthermore, by comparing the dotted and solid lines of this 
figure, it shows that the probability-dependent refinement of Pinsker's inequality, applied to the Poisson distribution, affects the lower bound 
for A < log (2). 



Remark 21: A comparison of the upper and lower bounds on the Hellinger distance in ( f90l > or the Bhattacharyya 
parameter in d9~TT ) gives the following lower bound on the relative entropy in terms of the total variation distance: 



L»(P||Q)>log 



1 



1 



(92) 



{d JW (P,Q)YJ 

It is noted that (l92l also follows from the combination of the last two inequalities in ll25l p. 741]. It is tighter than 
Pinsker's inequality (see (l68l ) if dj\(P,Q) > 0.893, having also the advantage of giving the right bound for the 
relative entropy (oo) when the total variation distance approaches to 1. However, d92l is a slightly looser bound 
on the relative entropy in comparison to Vajda's lower bound ll48l that reads: 

l + d TV (P,Q)\ 2d TY (P,Q) 



D{P\\Q) >log 



l-d JV (P,Q)J l + dxv(P,Q) 



(93) 



Corollary 3: Under the assumptions in Theorem [6l the Hellinger distance and Bhattacharyya parameter satisfy 
the following upper and lower bounds: 



1 - (A'!(A)) 2 Y>2 < rf„(F„,,Po(A)) < 



1 — exp 



1 n 



P> 



Pi 



(94) 
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and 

n3 



(1 G \ I 

~2X S T^-J ^ BC(P^,Po(A)) < . 1 - (^(A)) 2 (5>? 



(95) 



where i^i on the left-hand side of d94l ) and the right-hand side of d95l ) is introduced in (|73T ). 

Corollary 4: Let {S^}^^ be a sequence of random variables where S n = Ya=i -^Q is a sum of n independent 
Bernoulli random variables {X^}^ =1 with P(JG = 1) = pf 1 ^ (note that, for n ^ m, the binary random variables 

X?> and Xf l) may be dependent). Assume that E(5 n ) = SJLiPi = A for some A G (0, oo) and every n G N, 
and that there exist some fixed constants c\ , C2 > such that 

^<^<^, Vi€{l,...,n} 
n n 

(which implies that c\ < 1 and C2 > 1, and ci = C2 = 1 if and only if the binary random variables {JQ }£=i are 
i.i.d.). Then, the following asymptotic results hold: 

D(Pa.||Pb(A))=o(^) (96) 

d TV (P Sn ,Po(X)) =oQ (97) 

d H (Ps„,Po(A)) =oQ (98) 

BC(P 5 „,Po(A)) = l-oQ (99) 

so, the relative entropy between the distribution of S n and the Poisson distribution with mean A scales like \, the 
total variation and Hellinger distances scale like ^ and the gap of the Bhattacharyya parameter to 1 scales like \. 

2) Bounds on the Chernoff Information: 

Proposition 5: Let P and Q be two probability mass functions that are defined on a same set X. Then, the 
Chernoff information between P and Q is lower bounded in terms of the total variation distance as follows: 

C(P,Q) > -\ log(l - {d TY (P,Q)) 2 ). (100) 



Corollary 5: Under the assumptions in Theorem [6l the Chernoff information satisfies the following lower bound: 

(101) 



C(P w ,Po(X)) > -\ log h - (i^(A)) 2 (j^p^ \ 



where K\ is introduced in d73T ). 

Remark 22: Remark [T7] also applies to Corollaries [3] and [5] 

Remark 23: The combination of Proposition [5] with the lower bound on the total variation distance in (4j 
Theorem 2] (see Theorem Q] here) gives the following looser lower bound on the Chernoff information: 

C(P w ,Po(X)) > -1 log | I — ^ : i a — ) ( > ]i>7 ) | ■ < 102) 




The impact of the tightened lower bound in (11011 ). as compared to the bound in (11021 ) is exemplified in Section ITlI-EI 
in the context of the Bayesian approach for binary hypothesis testing. 
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E. Applications of the New Bounds in Section [777] 

In the following, we consider the use of the new bounds in Section [III] for binary hypothesis testing. 

Example 4 (Application of the C he rnoff- Stein lemma and lower bounds on the relative entropy): The Chernoff- 
Stein lemma considers the asymptotic error exponent in binary hypothesis testing when one of the probabilities of 
error is held fixed, and the other one has to be made as small as possible (see, e.g., ifTTl Theorem 11.8.3]). 

Let {Yj}f =1 be a sequence of non-negative, integer- valued i.i.d. random variables with E(YjJ = A for some 
A £ (0,oo). Let Y\ ~ Q where we consider the following two hypothesis: 

• Hi: Q = Pi where Yj, for j £ {1, . . . , N}, is a sum of n binary random variables {Xij}f =l with E(JQ j) = p$ 
an d Ya=i Pi = ^- I* * s assumeci that the elements of the sequence {JQj} are independent, and n £ N is fixed. 

• H 2 : Q = P 2 is the Poisson distribution with mean A (i.e., Y\ ~ Po(A)). 

Note that in this case, if one of the Yj exceeds the value n then Hi is rejected automatically, so one may assume 
that n 3> max{A, 1}. More explicitly, if Yj ~ Po(A) for j £ {1, . . . , N}, the probability of this event to happen is 
upper bounded (via the union and Chernoff bounds) by 

P(3 j € {1, . . . , N} : Yj > n + 1) < JV exp j- A + (n + 1) log (^j^ 



(103) 



so, if n > 10max{A, 1}, this probability is typically very small. 

For an arbitrary N £ N, let An be an acceptance region for hypothesis 1. Using standard notation, let 

cw^P/VU (3n = P 2 N (A n ) (104) 

be the two types of error probabilities. Following (TTJ Theorem 11.8.3], for an arbitrary e £ (0, let 

j3 e N = min j3 N 

A N Cy»:a N <e 

where y = {0, 1, . . . , n} is the alphabet that is associated with hypothesis Hi. Then, the best asymptotic exponent 
of P £ N in the limit where e — > is 

lim lim -j- log^ = -L»(Pa||P 2 ). 
From IfTTl Eqs. (11.206), (11.207) and (11.227)], for the relative entropy typical set that is defined by 

< e \ (105) 



4 ) (P 1 ||P 2 )^^€^ JV 



1 i nrp\\P\ 



then, it follows from the AEP for relative entropy that ajy < e for N large enough (see, e.g., IfTTl Theorem 11.8.1]). 
Furthermore, for every N (see, e.g, IfTTl Theorem 11.8.2]), 

p N <exp(-N(D(Pi\\P 2 )-e)y (106) 

The error probability of the second type /3/y is treated here separately from ajy. In this case, a lower bound on the 
relative entropy D(Pi\\P2) gives an exponential upper bound on /3jy- Let e — > (more explicitly, let e be chosen to 
be small enough as compared to a lower bound on D(Pi\\P2)). In the following two simple examples, we calculate 
the improved lower bound in Theorem [71 and compare it to the lower bound in d89l ). More importantly, we study 
the impact of Theorem [7] on the reduction of the number of samples N that are required for achieving (3^ < e. 
The following two cases are used to exemplify this issue: 

1) Let the probabilities {pi}f =1 (that correspond to hypothesis 1) be given by 

Pi = 1 —, Vi £ {1, . . . ,n}. 
n 

For A £ (0,oo), in order to satisfy the equality ^LiPi = X then Vn = ^j, and £™ =1 p? = ^ . 
From Theorem [TJ the improved lower bound on the relative entropy reads 

2A 2 2n + l x2 



£(Pi||P 2 ) > K 2 (\) — (107) 
3 n(n + 1), 
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where K 2 is introduced in d87l ), and the weaker lower bound in d89l) gets the form 

A 4 \ . f 1 1 / 2n + 1 x 2 



^I'^Hi^rn^u^TryJ ■ <108) 

Lets examine the two bounds on the relative entropy for A = 10 and n = 100 to find accordingly a proper 
value of N such that (3^ < 10~ 10 , and choose e = 10~ 10 . Note that the probability of the event that one of the 
N Poisson random variables {Yj}j =v under hypothesis H 2 , exceeds the value n is upper bounded in d 103b 
by 1.22N ■ 10 -62 , so it is neglected for all reasonable amounts of samples N. In this setting, the two lower 
bounds on the relative entropy in (11071 ) and (11081 ). respectively, are equal to 2.47 • 10 -4 and 3.44 • 10 -5 nats. 
For these two lower bounds, the exponential upper bound in (11061 ) ensures that /3 N < 10~ 10 for N > 9.32 -10 4 
and N > 6.70 • 10 5 , respectively. Hence, the improved lower bound on the relative entropy in Theorem [7] 
implies here a reduction in the required number of samples by a factor of 7.17. 
2) In the second case, assume that the probabilities {pi}f =1 scale exponentially in i (instead of the linear scaling 
in the previous case). Let a £ (0, 1) and consider the case where 

Pi=pia?~ 1 , Vi G {1, ...,n}. 

For A € (0,oo), in order to hold the equality Ya=iP^ = ^ men Pi = ^l-a"^ » anc * Ya=iPi = A • 
Hence, the improved lower bound in Theorem [7] and the other bound in ( fS9b imply respectively that 

D(Pm) > A 4 K 2 (A) (109) 
\ 1 + a 1 — a n J 

and / A 4 \ f 1 } { 1 - a 1 + a n \ 2 

^iy*^)^^) ■ <»<» 

The choice a = 0.05, A = 0.1 and n = 100, implies that the two lower bounds on the relative entropy in (11091 ) 
and (111 0b are respectively equal to 2.48 • 10 -5 and 1 .60 • 10~ 7 . The exponential upper bound in (11061 ) therefore 
ensures that (i^ < 10~ 10 for N > 9.26 • 10 5 and N > 1.44 • 10 8 , respectively. Hence, the improvement in 
Theorem [7] leads in this case to the conclusion that one can achieve the target error probability of the second 
type while reducing the number of samples {Yj}j =1 by a factor of 155. 

Example 5 (Application of the lower bounds on the Chernoff information to binary hypothesis testing): We turn 
to consider binary hypothesis testing with the Bayesian approach (see, e.g., ifTTl Section 11.9]). In this setting, one 
wishes to minimize the overall probability of error while we refer to the two hypotheses in Example [4] The 
best asymptotic exponent in the Bayesian approach is the Chernoff information (see (l64l ). and the overall error 
probability satisfies the following exponential upper bound: 

P e (N) < exp(-NC(P 1 ,P 2 )) (111) 

so, a lower bound on the Chernoff information provides an upper bound on the overall error probability. In the 
following, the two lower bounds on the Chernoff information in (11011 ) and (1 102b . and the advantage of the former 
lower bound is studied in the two cases of Example [4] in order to examine the impact of its improved tightness on 
the reduction of the number of samples N that are required to achieve an overall error probability below e = 10 -10 . 
We refer, respectively, to cases 1 and 2 of Example |4] 

1) In case 1 of Example [4] the two lower bounds on the Chernoff information in Corollary [5] and Remark 1231 
(following the calculation of J27=i Pi f° r mese two cases) are 

-i log (l - (Ki(A)) 2 (2§i 4^y) 2 ) From Eq. CEED (Corollary 0) 



\ log M - 5^ min{l, (^T)) 2 ) From Eq. CpJ) (Remark 



C(Px,P 2 ) > 



As in the first case of Example |4l let A = 10 and n = 100. The lower bounds on the Chernoff information 
are therefore equal to 

6.16 • 10~ 5 From Eq. (fTOTT) 
C(P 1 ,P 2 )>{ (112) 
8.59 • 10~ 6 From Eq. (flQ2l) . 
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Hence, in order to achieve the target < 10~ 10 for the overall error probability, the lower bounds on 

the Chernoff information in (II 12b and the exponential upper bound on the overall error probability in (111 II) 
imply that 

{3.74 • 10 5 From Eqs. (fTOTT) and (fTTTb 
(113) 
2.68 • 10 6 From Eqs. (fT02l and (fTTTb 

so, the number of required samples is approximately reduced by a factor of 7. 
2) For the second case in Example HJ the lower bounds on the Chernoff information in Eqs. (11011) and (1102b 
read 

■| log (l - A 4 (Ki(A)) 2 (£s i^) 2 ) From Eq. flOD 

■3 lQ g ( X " T^i m m{l, J From Eq. (Tj03 

so, the same choice of parameters a = 0.05, A = 0.1 and n = 100 as in Example |4] implies that 

f 4.93 • 10~ 6 From Eq. (fTOTT) 

C(Pi,P 2 )>4 (114) 
[ 4.00 • 10~ 8 From Eq. (fT02l) . 

For obtaining the target < 10~ 10 for the overall error probability, the lower bounds on the Chernoff information 
in (II 14b and the exponential upper bound on the overall error probability in (111 II ) imply that 

( 4.68 • 10 6 From Eqs. (fTOTT) and (fTTTb 
A^>< (115) 
[ 5.76 • 10 8 From Eqs. (fT02l and (fTTTb 

so, the improved lower bound on the Chernoff information implies in this case a reduction in the required number 
of samples N by a factor of 123. 



F. Proofs of the New Results in Section |777] 

1) Proof of Theorem® The proof of Theorem [6] starts similarly to the proof of [4, Theorem 2]. However, it 
significantly deviates from the original analysis in order to derive an improved lower bound on the total variation 
distance. In the following, we introduce the proof of Theorem [6] 

Let {Xi}™ =1 be independent Bernoulli random variables with E(AQ) = p^. Let W = Y2i=i-%i, V% — Ylj&Xj 
for every i G {1, ... , n}, and Z ~ Po(A) with mean A = YJi=\Pi- From the basic equation of the Chen-Stein 
method, the equality 

E[\f(Z + l)-Zf(Z)] = 0. (116) 
holds for an arbitrary bounded function / : No — > R. Furthermore 

E[Xf(W + l)-Wf(W)) 

n n 

= ^ Pj E[f(W + l)] -^E[X 3 f(W)] 
j=i 3=1 

n n 

= j2 Pj m[f(w+ 1)] - j>e[/(*$ + 1) = i] 

n 

®J2pMf(w+i)-f(v j + i)) 

3=1 
n 

= J>f E[/(W + 1) - f(Vj + 1) | Xj = 1] 

3=1 
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(b) 



J2p 2 j^[f(Vj + 2) - f(Vj + 1) | Xj = 1] 

3=1 

n 

Y /P 2 j E[f(V j + 2)-f(V j + l)] (117) 

3=1 



where equalities (a) and (b) hold since Xj and Vj are independent random variables for every j G {1, ... ,n}. By 
subtracting dl 161 ) from dl 171 ), it follows that for an arbitrary bounded function / : No — > R 



E[A/(W + 1) - W/(W0] - E[A/(Z + 1) - Zf(Z)] = J2p 2 Mf(Vj + 2) - /(^ + 1)] . (118) 

3=1 

In the following, an upper bound on the left-hand side of (II 181 ) is derived, based on total variation distance between 
the two distributions of W and Z. 



E[Xf(W + 1) - Wf(W)] -E[Xf(Z + 1) - Zf(Z)] 

oo 

= ^2 ( x f( k + 1) - W)) ( p (^ = fe ) - p ( z = k )) 



k=0 

oo 



< ^ |A/(fc + l) - kf(k)\ \¥(W = k)-F(Z = k)\ (119) 



k=0 



< sup\Xf(k + l)-kf(k)\ Y^\F(W = k) -F(Z = k)\ 



= 2d TV (P w , Po(A)) sup \Xf(k + 1) - kf(k)\ (120) 

where the last equality follows from (fj). Hence, the combination of (II 181 ) and (11201 ) gives the following lower 
bound on the total variation distance: 



n 

E^Mw + 2)- /(vj- + i)]} 



d TV (P w , Po(A)) > ^ (121) 

2 sup fc6No |A/(A; + 1) - kf(k)\ 

which holds, in general, for an arbitrary bounded function / : No — > R. 

At this point, we deviate from the proof of [4, Theorem 2] by generalizing and refining (in a non-trivial way) the 
original analysis. The general problem with the current lower bound in (11211 ) is that it is not calculable in closed 
form for a given /, so one needs to choose a proper function / and derive a closed-form expression for a lower 
bound on the right-hand side of (1121b . To this end, let 

/(fc) = (fc-«l)exp(- (fc ~" 2 H , VfcGNo (122) 



where a\, a-i G R and 9 G M + are fixed constants (note that in (11221 ) needs to be positive for / to be a bounded 
function). In order to derive a lower bound on the total variation distance, we calculate a lower bound on the 
numerator and an upper bound on the denominator of the right-hand side of (11211 ) for the function / in (1 122b - 
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Referring to the numerator of the right-hand side of (11211 ) with / in (1122b . for every j e {1, . . . , n}, 

f(Vj + 2) - f(Vj + 1) 

Vj+2-a 2 ^ 



du 



— (u + a 2 -a x ) exp- — du 



ex 



u 2 \ 



V,+l-a 2 
V j +2-a 2 

\--t- n , \ OX J P ^ 0\) 

V.+2-Q2 



2u(u + 012 — a\) 



Vf+l-aa 
Vj+2-a 2 

(a 2 - ai) 



2u z 



u 



, 2u 2 \ / u 2 \ , 



2(a 2 — OL\ l 

OX 



V 3 +2-a 2 



Vj+l-a2 



a exp( -j^ ) da 



exp 



(Vj + 2 - a 2 ) 



exp 



(Vj + 1 - a 2 ) 



#A / V #A 

We rely in the following on the inequality 

(1 - 2x) e~ x > 1 - 3x, Vi>0. 
Applying it to the integral on the right-hand side of (11231) gives that 



f(Vj + 2) - f(Vj + 1) 

3^ 
OX 



fVj +2-a 2 / 3m 2\ 

> / II — — ] du - (a 2 - a\ t 

JV 3 +l-a 2 V 

(y i + 2-a 2 ) 3 -(y j + l-a 2 ) 3 



{V l + 2-a 2 f\ ( (Vj + l-ai) 
exp — — exp 



ex 



ex 



> 



\Ot2 ~ OL\ \ ■ 



(Vj+2-a 2 ) 2 \ ( (Vj + l-a 2 ) 

exp l — —exp — 



6X J \ ox 
In order to proceed, note that if x±,x 2 > then (based on the mean-value theorem of calculus) 



U-x a _ e -xi| 



(123) 



(124) 



= e c (xi — x 2 )| for some c€ 
< e-^^'^lxi -x 2 | 

which, by applying it to the second term on the right-hand side of (11241 ). gives that for every j £ {1, . . . , n} 



exp 



< exp 



(y- + 2 - a 2 ) : 



exp 



+ 1 - Q 2 ) 



mm 



OX / V OX 

{(Vj + 2 - a 2 ) 2 , (Vj + 1 - q 2 ) 2 } ^ ({V 3 + 2- a 2 f - (Vj + 1 - a 2 ^ 2 



ex 



ex 



(125) 



Since Vj = Yji^j X t > then 



mm 



> 



{(V ? +2-a 2 ) 2 , (V^ + l-a,) 2 } 



if a 2 > 1 

1 - a 2 ) 2 if a 2 < 1 

2 



;i - a 2 ) 



(126) 



I. SASON: AN INFORMATION-THEORETIC PERSPECTIVE OF THE POISSON APPROXIMATION VIA THE CHEN-STEIN METHOD 



29 



where 

x + = max{x,0}, x\ = (x+) , G M 
Hence, the combination of the two inequalities in (1 1 25b — (1 1 26b gives that 

(Vj + 2-a 2 ) 2 \ ( (V j + l-a 2 ) 2 



exp 



< exp 



exp 



< exp 



ex 
ex 

(l-<*2# 

ex 
ex 



exp 



ex 



\(Vj + 2 - a 2 ) 2 - (Vj + 1 - a 2 ) 



ex 



\2Vj + 3-2a 2 \ 
ex 

2Vj + \3-2a 2 \ 
OX 



and therefore, a combination of the inequalities in (11241 ) and d 127b gives that 

f(Vj + 2) - f(Vj + 1) 



> 



(y j + 2-a 2 ) 3 -(V j + l-a 2 ) i 

ex 



\a 2 — cm • exp 



l-a 2 )\\ 2V j + \3-2a 2 \ 



ex 



ex 



Let Uj = Vj - X, then 



f{Vj + 2) - f(V 3 + 1) 

> 1 _ (Uj + X + 2- a 2 f - (Uj + A + 1 - a 2 f 



\a 2 — ai • exp 



ex 

[l-a 2 ) 2 + \ 2Uj + 2A + |3 - 2a 2 



ex 



ex 



3Uf + 3(3 - 2a 2 + 2\)Uj + (2 - a 2 + A) 3 - (1 - a 2 + Xf 



ex 



\a 2 — ai\ • exp 



:i-q 2 ) 
6>A 



2J7 i + 2A + |3-2a 2 
0A 



(127) 



(128) 



(129) 



In order to derive a lower bound on the numerator of the right-hand side of (11211) . for the function / in C| 1 22b . we 
need to calculate the expected value of the right-hand side of d 129b . To this end, the first and second moments of 



Uj are calculated as follows: 



E(Uj) 

= E(Vj)-X 

n 



4=1 



and 



E(U 2 ) 



(130) 



H(Vj - A) 2 



E 



^2(Xi-pi) -pj 



30 



SUBMITTED TO THE IEEE TRANSACTIONS ON INFORMATION THEORY IN JUNE 24, 2012. LAST UPDATED: SEPTEMBER 4, 2012. 



= E^-E^ 2+ p 2 

= X -Pj- ^Pi+pj- 



(131) 



where equalities (a) and (b) hold since, by assumption, the binary random variables {Xi}f =1 are independent and 
E(Xj) = pi, Var(Xj) = pi(l — pi). By taking expectations on both sides of (11291 ). one obtains from (11301 ) and 
(ED that 

E[f(V j + 2)-f(V j + l)] 

3 (A - Pj ~ Pi + Pj) + 3 ( 3 - 2 «2 + 2A) (- Pj ) + (2 - a 2 + A) 3 - (1 - a 2 + A) 3 



> 1 



a 2 — ai • exp 



(l-«2)j 

6>A 



#A 

-2p i + 2A+|3-2a 2 | 
0A 



3A + (2 - a 2 + A) 3 - (1 - a 2 + A) 3 - 3^(1 - Pj ) + 3£ i?y p? + 3(3 - 2a 2 + 2A)^ 



0A 



|ck2 — cki | (2A — 2pj + 1 3 — 2a 2 |) 

ex 



• exp 



ex 



> i 



3A + (2 - a 2 + A) 3 - (1 - a 2 + A) 3 - (9 - 6a 2 + 6A) Pj 



0A 



|«2 — «i (2A + |3 — 2a 2 |) 
#A 



• exp 



(1 ~ a 2 )i 

ex 



(132) 



Therefore, from (11321 ). the following lower bound on the right-hand side of (|121l) holds 

'3(3-2a 2 + 2A)' 



E{p 2 Mm+ 2 )-m + l )}}> 



3=1 



ex 



E? 



+ 1 



3A + (2 - a 2 + A) 3 - (1 - a 2 + A) 3 + |ai - a 2 | (2A + |3 - 2a 2 |) exp 



(l-a 2 ) 2 . 



0A 



+ \ n 



#A 



E? 2 - ( 133 ) 

i=i 



Note that if a 2 < A + |, which is a condition that is involved in the maximization of (l73l) . then the first term on 
the right-hand side of (1133b can be removed, and the resulting lower bound on the numerator of the right-hand side 
of <| 1 2 lb gets the form 



n n 

E{fi E [/(^ + 2 ) - f(Yi + 1)]} > (l - h x ( ai ,a 2 ,e)) 



P) 



(134) 



3=1 



3=1 



where the function h\ is introduced in (1741 . 

We turn now to derive an upper bound on the denominator of the right-hand side of (|1 2 1 b . Therefore, we need to 
derive a closed-form upper bound on sup feeNo ] A f(k + 1) — k f(k)\ with the function / in (1122b . For every k G Nq 



X f(k + 1) - k f(k) = X [f(k + 1) - f(k)] + (A - k) f(k). 



(135) 
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In the following, we derive bounds on each of the two terms on the right-hand side of ( 11351 ), and we start with the 
first term. Let 

( u \ 

t(u) = (u + Oi2 — a±) exp ( — — ) , V« G 



ex) 

then f(k) = t(k — a 2 ) f° r every k G No, and by the mean value of calculus 

f(k + l)-f(k) 

= t(k + 1 — 02) — t(k — 02) 

= t'(ck) for some c k G [k — 0:2, k + 1 — a 2 ] 



ex 



exp 



ex 



+ 



2(a 1 - a 2 )c k 

ex 



exp 



ex 



(136) 



By referring to the first term on the right-hand side of ( 11361) , let 

p(u) = (1 - 2u)e~ u , Vm>0 

then the global maximum and minimum of p over the non-negative real line are obtained at u = and u = |, 
respectively, and therefore 

-2e"f < < 1, V-u > 0. 
Let tt = then it follows that the first term on the right-hand side of (11361 ) satisfies the inequality 

2c 



ex 



2e -f < exp (-■ ^- < 1 



ex 



(137) 



Furthermore, by referring to the second term on the right-hand side of (11361) . let 

q{u)=ue- u \ Vtiel 

then the global maximum and minimum of q over the real line are obtained at 11 = +^ an d u = respectively, 



and therefore 



1/2 / \ 1/2 



Vu G M. 



Let this time u = \7i£, then it follows that the second term on the right-hand side of (11361) satisfies 



2(ai - a 2 )c fc 
6»A 



• exp 



< 777- • |aa - Q!2|- 



(138) 



Hence, by combining the equality in (11361 ) with the two inequalities in (11371 ) and (11381 ). it follows that the first 
term on the right-hand side of ( 11351 ) satisfies 



'2A 



/2A 



- l2Ae-3+y— ■\a 1 -a 2 \ J < X[f(k + 1) - f(k)] <\+\J— ■ |oi - a 2 | , V/cGN . (139) 

We continue the analysis by a derivation of bounds on the second term of the right-hand side of (1 1 35b - For the 
function / in (|1221) . it is equal to 



(A - k) f(k) 

= (A — k)(k — a±) exp 



{k-a 2 f 

ex 



[(A - a 2 ) + («2 - k)] [(k - a 2 ) + (a 2 - «i)] exp 



OL2) 



ex 



(A - a 2 )(k - a 2 ) + (a 2 - «i)(A - a 2 ) - (k — a 2 ) 2 + (ai - a 2 )(A; - a 2 ) 



[v^A(A - «2)«fc - OXvl - Ve~X(a2 - a\)vk + (a 2 - ai)(A - a 2 )] e Ufc 
(c + civ k + c 2 vl) e~ v * 



exp 



(k - a 2 f 



ex 

k — 0(2 



ex 



Vfe G N 



(140) 
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where the coefficients cq,ci and c 2 are introduced in Eqs. (|79t-([8T|), respectively. In order to derive bounds on the 
left-hand side of (1140b . lets find the global maximum and minimum of the function x in (1771) : 

x(u) = (c + cj.it + c 2 M 2 )e"" 2 VuGM. 

Note that lim^-too x(u) = and x is differentiable over the real line, so the global maximum and minimum 
of x are attained at finite points and their corresponding values are finite. By setting the derivative of x to zero, 
the candidates for the global maximum and minimum of x over the real line are the real zeros {ui} of the cubic 
polynomial equation in ( P78T ). Note that by their definition in ( P78l ), the values of {u{\ are independent of the value 
of k G No, and also the size of the set {ui} is equal to 3 (see Remark [131) . Hence, it follows from (11401 ) that 

min {x(ui)\ < (A - k) f(k) < max {x( Ui )} , VfcGN (141) 

iS{l,2,3} ie{l,2,3} 

where these bounds on the second term on the right-hand side of dl35l ) are independent of the value of k G No- 

In order to get bounds on the left-hand side of (11351 ). note that from the bounds on the first and second terms 
on the right-hand side of d 135b (see d 1 39b and (1141b . respectively) then for every k G No 




min {x(ui)} — 2Xe 2 + \j — • |«i — 02! 

«S{1,2,3} \ 

<Xf(k + l)-kf(k) 




< max {x(ui)} + A + \ — ■ \a\ - a 2 | (142) 

ie{i,2,3} V Be 

which yields that the following inequality is satisfied: 

sup \Xf(k + l)-kf(k)\ < g x (a 1 ,a 2 ,8) (143) 

feeNo 

where the function g\ is introduced in (f76l ). Finally, by combining the inequalities in Eqs. (|121b . (1134b and (1 1 43b . 
the lower bound on the total variation distance in d72b follows. The existing upper bound on the total variation 
distance in ( P72b was derived in |4 ( Theorem 1] (see Theorem Q] here). This completes the proof of Theorem [6] 

2 ) Proof of Corollary |2} Corollary |2] follows as a special case of Theorem [6] when the proposed function / in 
(1122b is chosen such that two of its three free parameters (i.e., a\ and a 2 ) are determined sub-optimally, and its 
third parameter (B) is determined optimally in terms of the sub-optimal selection of the two other parameters. More 
explicitly, let a± and a 2 in (|122b be set to be equal to A (i.e., a\ = a 2 = A). From d79t-(l8l1). this setting implies 
that Co = ci = and c 2 = — BX < (since B,X > 0). The cubic polynomial equation in (P78l ). which corresponds 
to this (possibly sub-optimal) setting of a\ and a 2 , is 

2c 2 n 3 - 2c 2 u = 



whose zeros are u = 0, ±1. The function x in (FTTh therefore gets the form 

x(u) = 

so x(0) = and x(±l) = ^ < 0. It implies that 



x(u) = C2U 2 e u Vu G 



min x(ui) = —, max xiui) = 0, 
ie{i,2,3} e ie{i,2,3} 



h x (X,X,B) = —^, (144) 



and therefore h\ and g\ in (1741 ) and (1761) . respectively, are simplified to 

A + 
BX 



gx (X,X,B) = X max{l,2e _ t +Be~ 1 }. (145) 

This sub-optimal setting of a\ and a 2 in (1122b implies that the coefficient K\ in (1731 is replaced with a loosened 
version 

^(AjA^pfif^MV (146) 



A 

6»>0 



2 5A (A,A,^) 
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Let 9 > e - then (fl45T ) is simplified to g x (X, A, 6) = A (2e~t + Be' 1 ). It therefore follows from dl, ([73]) and 
(f!44l)-(fT46l) that 

n 

d TW (P w ,?o(X)) > K^X) Y^Pi (147) 

i=l 

where 

/ 1 _ 3A+7 \ 

KUX) = sup g-^ (148) 

and, in general, i^(A) > Ki(X) due to the above restricted constraint on 9 (see (11461 ) versus (I148I )). Differentiation 
of the function inside the supremum w.r.t. 9 and by setting its derivative to zero, one gets the following quadratic 
equation in 9: 

X9 2 — 2(3A + 7)9- 2(3A + 7)e~ 1 = 

whose positive solution is the optimized value of 9 in d85l ). Furthermore, it is clear that this value of 9 in (1851 ) is 
larger than, e.g., 3, so it satisfies the constraint in (1 148b - This completes the proof of Corollary [2] 

3 ) Discussion on the Connections of Theorem [6] and Corollary [2] to [4. Theorem 2 ]: As was demonstrated in 
the previous sub-section, Theorem [6] implies the satisfiability of the lower bound on the total variation distance in 
Corollary [2] In the following, it is proved that Corollary [2] implies the lower bound on the total variation distance in 
||4l Theorem 2] (see also Theorem Q] here), and the improvement in the tightness of the lower bound in Corollary [2] 
is explicitly quantified in the two extreme cases where A — > and A — > oo. The observation that Corollary [2] 
provides a tightened lower bound, as compared to 01 Theorem 2], is justified by the fact that the lower bound in 
(11471 ) with the coefficient Ki(\) in (11481 ) was loosened in the proof of \4, Theorem 2] by a sub-optimal selection 
of the parameter 9 which leads to a lower bound on i^i(A) (the sub-optimal selection of 9 in the proof of @ 
Theorem 2] is 9 = 21max{l, i}). On the other hand, the optimized value of 9 that is used in (l85l) provides an 
exact closed-form expression for -f^i(A) in (1148b . and it leads to the derivation of the bound in Corollary [2] This 
therefore justifies the observation that the lower bound on the total variation distance in Corollary [2] implies the 
original lower bound in @ Theorem 2]. 

From |4, Theorems 1 and 2], the ratio between the upper and lower bounds on the total variation distance (these 
bounds also appear in ©) is equal to 32 in the two extreme cases where A — > or A — > oo. In order to quantify 
the improvement that is obtained by Corollary [2] (that follows by the optimal selection of the parameter 9), we 
calculate in the following the ratio of the same upper bound and the new lower bound in this corollary at these 
two extreme cases. In the limit where one lets A tend to infinity, this ratio tends to 

lim -. ^ — (9 = 0(X) is given in Eq. ([85])) 

\2\(2e-3/i+8e-i) J ^ i=lPi 

2e- 3 / 2 + 9e~ 1 
= 2 lim ^— = — 

A^oo 1-3^7 

2 9(2e- 1 ' 2 + 9) 
= - lim =-- 

e A^oo 9 - (3 + I) 



(a) 



2(3 + v / 3(3 + 2e- 1 /2)) ( 3 + 2e -V2 + ^3(3 + 2e -i/2; 



e V3(3 + 2e-V2) 



\ (3+A/3(3 + 2e-V2)) ( 





(149) 
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where equality (a) holds since, from d85l ), lim^oo 6 = 3 + \/3(3 + 2e -1 / 2 ). Furthermore, the limit of this ratio 
when A tends to zero is equal to 

, A-e"^ , [M2e- 3 ' 2 +0e- l )\ 



2 lim 



( a) 28 / 2e" 1 /2 + e) , \ 
= — lim 

e A^o 

(b) 56 



20.601 (150) 

e 

where equalities (a) and (b) hold since, from (l85l ). it follows that lining (^$) = 14- Note that the two limits in 
(11491 ) and (11501 ) are indeed consistent with the limits of the upper curve in Figure. Q] (see p. l20l . This implies that 
Corollary [2] improves the original lower bound on the total variation distance in J31 Theorem 2] by a factor of 
10 32 39 w 3.037 in the limit where A — >• oo, and it improves it by a factor of 2 o 601 ~ 1-^53 in the other extreme 
case where A — > while still having a closed-form expression lower bound in Corollary [2] where the only reason 
for this improvement that is related to the optimal choice of the free parameter 6, versus its sub-optimal choice in 
the proof of @ Theorem 2], shows a sensitivity of the resulting lower bound to the selection of 9. This observation 
in fact motivated us to further improve the lower bound on the total variation distance in Theorem [6] by introducing 
the two additional parameters ai,a2 £ M of the proposed function / in (1122b (which, according to the proof in the 
previous sub-section, are set to be both equal to A). The further improvement in the lower bound at the expense 
of a feasible increase in computational complexity is shown in the plot of Figure. \T\ (by comparing the upper and 
lower curves of this plot which correspond to the ratio of the upper bound in (U Theorem 1] and new lower bounds 
in Corollary [2] and Theorem [6l respectively. It is interesting to note that no improvement is obtained however in 
Theorem [6l as compared to Corollary |2l for A > 20, as is shown in Figure Q] (since the the upper and lower curves 
in this plot merge for A > 20, and their common limit in the extreme case where A — > oo is given in dl49l ); 
this therefore implies that the two new lower bounds in Theorem [6] and Corollary [2] coincide for these values of 
A; however, for this range of values of A, the lower bound on the total variation distance in Corollary [2] has the 
advantage of being expressed in closed form (i.e., there is no need for a numerical optimization of this bound). 
Due to the above discussion, another important reasoning for our motivation to improve the lower bound on the 
total variation distance in Theorem [6] and Corollary |2] is that the factors of improvements that are obtained by these 
lower bounds (as compared to the original bound) are squared, according to Pinsker's inequality, when one wishes 
to derive lower bounds on the relative entropy, and this improvement becomes significant in many inequalities 
in information theory and statistics where the relative entropy appears in the error exponent (as is exemplified in 
Section IIII-EI) . Finally, it is noted that the reason for introducing this type of discussion, which partially motivates 
our paper, in a sub-section that refers to proofs (of the second half of this work) is because this kind of discussion 
follows directly from the proofs of Theorem [6] and Corollary |2l and therefore it was introduced here. 

4) Proof of Theorem^ In the following we prove Theorem[7]by obtaining a lower bound on the relative entropy 
between the distribution P\y of a sum of independent Bernoulli random variables {Xj}™ =1 with Xi ~ Bern(pj) and 
the Poisson distribution Po(A) with mean A = Y^i=\Pi- A first lower bound on the relative entropy follows from 
a combination of Pinsker's inequality (see Eq. (l68l l) with the lower bound on the total variation distance between 
these distributions (see Theorem [6]). The combination of the two gives that 

D{P w \\Vo{\)) >2(K 1 (X)) 2 (X>*1 • < 151 > 

This lower bound can be tightened via the distribution-dependent refinement of Pinsker's inequality in |[38l , which 
is introduced shortly in Section IIII-Ai Following the technique of this refinement, let Q = Yl\ be the probability 
mass function that corresponds to the Poisson distribution Po(A), i.e., 

e~ x X k 

Q{k) = —— VfceNo. 
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If A < log 2 then Q(0) = e~ x > ±. Hence, from (70]), the maximization of mm{Q(A), 1 - Q(A)} over all the 
subsets A C No is obtained for A = {0} (or, symmetrically, for A = No\{0} = N) which implies that, if A < log 2, 
one gets from Eqs. d69]), (70]) and (7]} that 



?! 



d(p w ||po(A)) >^(n Q )(^ 1 (A)) 2 5>n • (152) 



^i = l 



where 



and, since Hq < ^ then 



n Q = min{e~\l-e~ A } = 1 - e~ A (153) 



, s i /i-n c 



i - 2n Q V n 



i \ . / i 



log . (154) 



Hence, the combination of (1151b . (11521 ). and (11541 ) gives the lower bound on the relative entropy in Theorem [7] (see 
Eqs. (l86l ). (l87l ) and (I88T ) in this theorem). The upper bound on the considered relative entropy is a known result 
(see 11331 Theorem 1]), which is cited here in order to have both upper and lower bounds in the same inequality 
(see Eq. (l86l)). This completes the proof of Theorem [7J 

5) Proof of Proposition^ We start by proving the tightened upper and lower bounds on the Hellinger distance 
in terms of the total variation distance and relative entropy between the two considered distributions. These refined 
bounds in (l90l) improve the original bounds in (l66l) . It is noted that the left-hand side of (l66l) is proved in ll39l 
p. 99], and the right-hand side is proved in j39l p. 328]. The following is the proof of the refined bounds on the 
Hellinger distance in d90l ). 

Lets start with the proof of the left-hand side of d90l) . To this end, let P and Q be two probability mass functions 
that are defined on a same set X. From (fSJ), doTT) and the Cauchy-Schwartz inequality 

d T y(P,Q) 



2 



\x£X / \x<=X 



\ xGX 

= d H (P,Q){2- (d H (P,Q)) 2 y . (155) 

Let c = (djv(P,Q)) 2 and x = (dn(P,Q)) 2 , then it follows by squaring both sides of (11551 ) that x(2 — x) > c, 
which therefore implies that 

l-^/I^<x< 1 + VT^c. (156) 

The right-hand side of (11561 ) is satisfied automatically since < dn(P,Q) < 1 implies that x < 1. The left-hand 
side of (11561 ) gives the lower bound on the left-hand side of d90b - Next, we prove the upper bound on the right-hand 
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side of d90b - By Jensen's inequality 



»2 



{du(P,Q)Y 

= \ E{(^M-yoM) 2 } 

xdX ^ > 



l-Y J ^/P(x)Q(x) 

xeX 



1 - £ P(x) « 

i-]Tp(x) e 



'Q(x) 
P(aO 



5 



< i.glE^PW i°s(fg}) 

= l_ e -|o(P||Q) (15?) 

which completes the proof of (f9Qb . The other bound on the Bhattacharyya parameter in d9lT ) follows from (f90l > and 
the simple relation in (l63l between the Bhattacharyya parameter and Hellinger distance. This completes the proof 
of Proposition |4] 

Remark 24: The weaker bounds in (l66l ). proved in [39], follow from their refined version in (|90l by using the 
simple inequalities 



and 



VI - x < 1 - | , VxG[0, 1] 



> 1 — x, V a; > 0. 



6) Proof of Corollary \3} This corollary is a direct consequence of Theorems [6] and |7J and Proposition |4] 

7) Proof of Corollary |?} Under the conditions in Corollary SI the asymptotic scaling of the total variation 
distance, relative entropy, Hellinger distance and Bhattacharyya parameter follow from their (upper and lower) 
bounds in Theorems [6] and [7] and Eqs. d94t and d95l ), respectively. This completes the proof of Corollary |4] 

8 ) Proof of Proposition |5} Let P and Q be two arbitrary probability mass functions that are defined on a same 
set X. We derive in the following the lower bound on the Chernoff information in terms of the total variation 
distance between P and Q, as is stated in (1 100b . 



C(P, Q) > - log ( VP(x)Q(x)\ 

^-log BC(P,Q) 
( = } -log (l-(d H (P,Q)) 2 ) 

( d ) 1 / 9 

> - log(l-(d TV (P,Q)) 2 



where inequality (a) follows by selecting the possibly sub-optimal choice 9 = | in (l64l ). equality (b) holds by 
definition of the Bhattacharyya parameter (see d62l)), equality (c) follows from the equality in d63l that relates the 
Hellinger distance and Bhattacharyya parameter, and inequality (d) follows from the lower bound on the Hellinger 
distance in terms of the total variation distance (see d90ll). This completes the proof of Proposition [5] 

9) Proof of Corollary |5J; This corollary is a direct consequence of the lower bound on the total variation 
distance in Theorem |6j and the lower bound on the Chernoff information in terms of the total variation distance in 
Proposition [5] 
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